Background Neural Fitted Actor-Critic Future works Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine, LORIA 8 th July 2016 1/18
Background Neural Fitted Actor-Critic Future works Outline Background 1 Neural Fitted Actor-Critic 2 Future works 3 2/18
Background Neural Fitted Actor-Critic Future works Reinforcement Learning 3/18
Background Neural Fitted Actor-Critic Future works Reinforcement Learning Optimization problem � ∞ � t =0 γ t r t Find function π : S → A that maximize rewards E π � 3/18
Background Neural Fitted Actor-Critic Future works Constraints and Motivations Reinforcement learning + Developmental robotics : Continuous environments 1 No prior models of agent or environment 2 Use non linear approximator (neural networks) 3 No prior goal states or trajectories 4 4/18
Background Neural Fitted Actor-Critic Future works How to solve reinforcement learning problems ? Actor-only π : S → A play update ∞ t =0 γ t r t � π k +1 π k 5/18
Background Neural Fitted Actor-Critic Future works How to solve reinforcement learning problems ? Actor-only π : S → A play update ∞ t =0 γ t r t � π k +1 π k Critic-only Q : S × A → R play update deduce ∞ t =0 γ t r t � Q k +1 Q k π k 5/18
Background Neural Fitted Actor-Critic Future works How to solve reinforcement learning problems ? Actor-only π : S → A play update ∞ t =0 γ t r t � π k +1 π k Critic-only Q : S × A → R play update deduce ∞ t =0 γ t r t � Q k +1 Q k π k Actor-Critic π : S → A V : S → R play update ∞ t =0 γ t r t � π k V k +1 π k +1 V k update 5/18
Background Neural Fitted Actor-Critic Future works State of the art Critic only Fitted Q Iteration Q Learning, Sarsa Actor only Evolutionnary algorithms (CMA-ES, ...) PI 2 Actor-critic Natural Actor Critic Cacla 6/18
Background Neural Fitted Actor-Critic Future works State of the art Unsatisfied Constraints : (1) No Continuous environments (2) No prior models of agent or environment Critic only (3) Use linear approximator (4) No prior goal states or trajectories Fitted Q Iteration (1) Q Learning, Sarsa (1) Actor only Evolutionnary algorithms (CMA-ES) → poor data efficiency PI 2 (3) (4) Actor-critic Natural Actor Critic (3) (4) Cacla → poor data efficiency, lot of meta-parameters 7/18
Background Neural Fitted Actor-Critic Future works Landscape algorithms decisional complexity ideal algorithm data required 8/18
Background Neural Fitted Actor-Critic Future works Landscape algorithms decisional complexity NFQ ideal algorithm data required 8/18
Background Neural Fitted Actor-Critic Future works Neural Fitted Q (NFQ) decisional complexity NFQ data required N �� 2 � � a ′ ∈ A Q k ( s t +1 , a ′ ) � Q k +1 = arg min Q ( s t , a t ) − r t +1 + γ max Q ∈F c t =1 π ∗ ( s ) = arg max Q ( s , a ) a ∈ A Hidden Inputs Outputs layer Q ( s , a ) s 1 s 2 s 3 a 1 a 2 9/18
Background Neural Fitted Actor-Critic Future works CACLA decisional complexity Temporal Difference Error δ t = r t + γ V ( s t +1 ) − V ( s t ) CACLA CMA-ES data required Critic V k +1 ( s ) = V k ( s ) + α v δ t ∂ V t ( s t ) θ V i , k +1 = θ V i , k +1 + α v δ t ∂θ V i , k +1 Actor α a ( a t − u t ) ∂ u t ( s t ) ∂θ t , if δ > 0 θ t +1 = θ t + 0 , otherwise 10/18
Background Neural Fitted Actor-Critic Future works Neural Fitted Actor Critic 2) a. actor update { s t , a t } a ∼ π Environment > 0 π δ Agent Rprop V 0 ≤ 0 { s t , u t } u D π repeat V k V k +1 s , r Rprop { s t , v k , t } 1) interactions 2) b. critic update 11/18
Background Neural Fitted Actor-Critic Future works Neural Fitted Actor Critic decisional complexity � 2 � � V k +1 ← argmin V ( s t ) − r t +1 + γ V k − 1 ( s t +1 ) NFAC V ∈F c s t ∈D π data required Hidden Inputs Outputs layer V ( s ) s 1 s 2 s 3 � a t , if δ t > 0 � 2 � � π k +1 ← argmin π ( s t ) − u t , otherwise π ∈F a s t ∈D π Hidden Inputs Outputs layer s 1 a 1 s 2 a 2 s 3 12/18
Background Neural Fitted Actor-Critic Future works Experimental Results 13/18
Background Neural Fitted Actor-Critic Future works Landscape algorithms decisional complexity NFQ NFAC CACLA ideal CMA-ES algorithm data required 14/18
Background Neural Fitted Actor-Critic Future works Landscape algorithms decisional complexity NFQ DDPG NAF ? NFAC CACLA ideal CMA-ES algorithm data required 15/18
Background Neural Fitted Actor-Critic Future works Methods landscape decisional complexity NFQ NFAC+ DDPG NAF ? NFAC CACLA ideal CMA-ES algorithm data required 16/18
Background Neural Fitted Actor-Critic Future works Toward a better data efficiency Fitted Actor-Critic N �� 2 � � Q π � r t +1 + γ Q π k +1 = argmin c ( a t | s t ) Q ( s t , a t ) − k ( s t +1 , π ( s t +1 )) Q ∈F c t =1 N � � � π k +1 = argmax Q k +1 s t , π k ( s t ) π ∈F a t =1 1 , π ( a t | s t ) � � c ( a t | s t ) = min π 0 ( a t | s t ) 17/18
Background Neural Fitted Actor-Critic Future works Conclusion & Further Works Neural Fitted Actor Critic Compare to DDPG Don’t forget previous data Guided exploration of sensorimotor space Increase the dimension of states/actions Redefine the reward function for the new sub-goal 18/18
Recommend
More recommend