Motivation Our Results/Contribution Summary Hessian Aided Policy Gradient Z. Shen 1 , A. Ribeiro 2 , H. Hassani 2 , H. Qian 1 , C. Mi 1 1 Department of Computer Science and Technology Zhejiang University 2 Department of Electrical and Systems Engineering University of Pennsylvania International Conference on Machine Learning, 2019 Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Our Results/Contribution Summary Outline Motivation 1 Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization Our Results/Contribution 2 Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Reinforcement Learning via Policy Optimization Our Results/Contribution Variance Reduction for Oblivious Optimization Summary Outline Motivation 1 Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization Our Results/Contribution 2 Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Reinforcement Learning via Policy Optimization Our Results/Contribution Variance Reduction for Oblivious Optimization Summary Policy Optimization as Stochastic Maximization θ ∈ R d J ( θ ) def max = E τ ∼ π θ [ R ( τ )] def = ( S , A , P , r , ρ 0 , γ ) MDP P : S × A × S → [ 0 , 1 ] , r : S × A → R ; Policy : π θ ( ·| s ) : A → [ 0 , 1 ] , ∀ s ∈ S ; def Trajectory : τ = ( s 0 , a 0 , . . . , a H − 1 , s H ) ∼ π θ : a i ∼ π θ ( ·| s i ) , s i + 1 ∼ P ( ·| s i , a i ) , s 0 ∼ ρ 0 ( · ) Probability and discounted cumulative reward of a trajectory: H − 1 � def p ( τ ) = ρ ( s 0 ) p ( s h + 1 | s h , a h ) π θ ( a h | s h ) h = 0 H − 1 � def γ h r ( s h , a h ) R ( τ ) = h = 0 Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Reinforcement Learning via Policy Optimization Our Results/Contribution Variance Reduction for Oblivious Optimization Summary Policy Optimization with REINFORCE θ ∈ R d J ( θ ) def max = E τ ∼ π θ [ R ( τ )] Non-oblivious: p ( τ ) depends on θ REINFORCE (SGD) θ t + 1 := θ t + η g ( θ ; S τ ) finds �J ( θ ǫ ) � ≤ ǫ ( ǫ -FOSP) using O ( 1 /ǫ 4 ) samples of τ 1 � g ( θ ; S τ ) def = R ( τ ) ∇ log π θ θ ( τ ) , τ ∈ S τ ∼ π θ |S τ | τ ∈S τ Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Reinforcement Learning via Policy Optimization Our Results/Contribution Variance Reduction for Oblivious Optimization Summary Outline Motivation 1 Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization Our Results/Contribution 2 Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Reinforcement Learning via Policy Optimization Our Results/Contribution Variance Reduction for Oblivious Optimization Summary Oblivious Stochastic Optimization θ ∈ R d L ( θ ) def = E z ∼ p ( z ) [ ˜ min L ( θ ; z )] (1) Oblivious: p ( z ) is independent of θ Stochastic Gradient Descent (SGD) θ t + 1 := θ t − η ∇ ˜ L ( θ t ; S z ) finds �L ( θ ǫ ) � ≤ ǫ ( ǫ -FOSP) using O ( 1 /ǫ 4 ) samples of z 1 � L ( θ ; S z ) def ˜ ˜ = L ( θ ; z ) |S z | z ∈S z Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Reinforcement Learning via Policy Optimization Our Results/Contribution Variance Reduction for Oblivious Optimization Summary Variance Reduction Oblivious Case θ R d L ( θ ) def = E z ∼ p ( z ) [ ˜ min L ( θ ; z )] (2) Oblivious: p ( z ) is independent of θ � � SPIDER g t := g t − 1 + ∆ t def ∇ ˜ L ( θ t ; S z ) − ∇ ˜ L ( θ t − 1 ; S z ) = � �� � E S z [∆ t ]= ∇L ( θ t ) −∇L ( θ t − 1 ) θ t + 1 := θ t − η · g t , ( E [ g t ] = ∇L ( θ t )) finds �L ( θ ǫ ) � ≤ ǫ using O ( 1 /ǫ 3 ) samples of z 1 � L ( θ ; S z ) def ˜ ˜ = L ( θ ; z ) |S z | z ∈S z Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Reinforcement Learning via Policy Optimization Our Results/Contribution Variance Reduction for Oblivious Optimization Summary Variance Reduction Non-oblivious Case? θ ∈ R d J ( θ ) def max = E τ ∼ π θ [ R ( τ )] (3) Non-oblivious: p ( τ ) depends on θ SPIDER � � g t := g t − 1 + ∆ t def g ( θ t ; S τ ) − g ( θ t − 1 ; S τ ) = , τ ∈ S τ ∼ π θ t � �� � E S τ [∆ t ] � = ∇J ( θ t ) −∇J ( θ t − 1 ) θ t + 1 := θ t + η g t , ( E [ g t ] � = ∇J ( θ t )) 1 � g ( θ ; S τ ) def = R ( τ ) ∇ log π θ θ ( τ ) |S τ | τ ∈S τ Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Variance Reduction for Non-oblivious Optimization Our Results/Contribution Unbiased Policy Hessian Estimator Summary Outline Motivation 1 Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization Our Results/Contribution 2 Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Variance Reduction for Non-oblivious Optimization Our Results/Contribution Unbiased Policy Hessian Estimator Summary Variance Reduction for Non-oblivious Optimization θ t + 1 := θ t + η g t , ( E [ g t ] = ∇J ( θ t )) g t := g t − 1 + ∆ t , E [∆ t ] = ∇J ( θ t ) − ∇J ( θ t − 1 ) = a · θ t + ( 1 − a ) · θ t − 1 , a ∈ [ 0 , 1 ] def θ a � 1 [ ∇ 2 J ( θ a ) · ( θ t − θ t − 1 )] d a ∇J ( θ t ) − ∇J ( θ t − 1 ) = 0 �� 1 � · ( θ t − θ t − 1 ) ∇ 2 J ( θ a ) d a = 0 ∇ 2 ( θ a ; τ a )] = ∇ 2 J ( θ a )) = E a ∼ Uni ([ 0 , 1 ]) [ ∇ 2 J ( θ a )] · ( θ t − θ t − 1 ) , ( E τ a [ ˜ ∇ 2 ( θ a ) · ( θ t − θ t − 1 )] = E [ ˜ Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Variance Reduction for Non-oblivious Optimization Our Results/Contribution Unbiased Policy Hessian Estimator Summary Variance Reduction Non-oblivious Case! θ ∈ R d J ( θ ) def max = E τ ∼ π θ [ R ( τ )] (4) HAPG g t := g t − 1 + ˜ ∇ 2 ( θ t , θ t − 1 ; S a ,τ )[ θ t − θ t − 1 ] θ t + 1 := θ t + η g t , ( E [ g t ] = J ( θ t )) 1 � ∇ 2 ( θ t , θ t − 1 ; S a ,τ ) def ˜ ∇ 2 ( θ a ; τ a ) , ˜ = |S a ,τ | ( a ,τ a ) ∈S a ,τ = a · θ t + ( 1 − a ) · θ t − 1 ) def where a ∼ Uni ([ 0 , 1 ]) , τ a ∼ π θ a . ( θ a finds �J ( θ ǫ ) � ≤ ǫ using O ( 1 /ǫ 3 ) samples of τ . Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Variance Reduction for Non-oblivious Optimization Our Results/Contribution Unbiased Policy Hessian Estimator Summary Outline Motivation 1 Reinforcement Learning via Policy Optimization Variance Reduction for Oblivious Optimization Our Results/Contribution 2 Variance Reduction for Non-oblivious Optimization Unbiased Policy Hessian Estimator Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Variance Reduction for Non-oblivious Optimization Our Results/Contribution Unbiased Policy Hessian Estimator Summary Unbiased Policy Hessian Estimator � � ∇J ( θ ) = R ( τ ) ∇ p ( τ ; π θ ) d τ = p ( τ ; π θ ) · [ R ( τ ) ∇ log p ( τ ; π θ )] d τ τ τ ∇ 2 J ( θ ) � R ( τ ) ∇ p ( τ ; π θ )[ ∇ log p ( τ ; π θ )] ⊤ + p ( τ ; π θ ) · [ R ( τ ) ∇ 2 log p ( τ ; π θ )] d τ = τ � R ( τ ) p ( τ ; π θ ) {∇ log p ( τ ; π θ )[ ∇ log p ( τ ; π θ )] ⊤ + ∇ 2 log p ( τ ; π θ ) } d τ = τ = R ( τ ) {∇ log p ( τ ; π θ )[ ∇ log p ( τ ; π θ )] ⊤ + ∇ 2 log p ( τ ; π θ ) } , τ ∼ π θ . ˜ def ∇ 2 ( θ ; τ ) Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Motivation Our Results/Contribution Summary Summary First method that provably reduces the sample complexity to achieve an ǫ -FOSP of the RL objective from O ( 1 ǫ 4 ) to O ( 1 ǫ 3 ) . Shen, Ribeiro, Hassani, Qian, Mi Hessian Aided Policy Gradient
Recommend
More recommend