deep neural networks and deep reinforcement learning
play

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, - PowerPoint PPT Presentation

Deep Neural Networks and Deep Reinforcement Learning Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and Courville [chapt. 6,7,8]; AIMA [sect. 21.1-21.3]; Sutton and Barto, Reinforcement Learning: an


  1. Deep Neural Networks and Deep Reinforcement Learning Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and Courville [chapt. 6,7,8]; AIMA [sect. 21.1-21.3]; Sutton and Barto, Reinforcement Learning: an Introduction, 2nd Edition [sect. 5.1-5.3, 6.1-6.3, 6.5]

  2. Outline Deep Neural Networks and Deep Reinforcement Learning ♦ Neural Networks: intro ♦ Deep Neural Networks ♦ Deep Reinforcement Learning ♦ Deep Q Network ♦ Slides based on course offered by Prof. Pascal Poupart at Univ. of Waterloo.

  3. Reinforcement Learning: key points Deep Neural Networks and Deep Reinforcement Learning ♦ MDPs and Value iteration (planning) s ′ T ( s , a , s ′ )( R ( s , a , s ′ ) + γ V k ( s ′ )) V ( s ) = max a � ♦ TD Learning and Q-Learning(reinforcement learning) Q ( s , a ) = (1 − α ) Q ( s , a ) + α ( R ( s , a , s ′ ) + γ max a ′ Q ( s , a ′ )) ♦ Key Issues: number of states and actions maybe too many to maintain and update the Q-Table

  4. Reinforcement Learning: Example of large state spaces Deep Neural ♦ the game of Go: 3 361 Networks and Deep Reinforcement Learning

  5. Reinforcement Learning: Example of large state spaces Deep Neural Networks and Deep Reinforcement ♦ Cart pole control problem: ( x , x ′ , θ, θ ′ ) continuous Learning

  6. Reinforcement Learning: Example of large state spaces Deep Neural Networks and Deep Reinforcement ♦ Atari games: 210x160x3 (possible pixels considering the Learning RGB layers)

  7. Key Idea: function approximation Deep Neural Networks and Deep Reinforcement Learning ♦ Which functions are we interested in: Policy π ( s ) → a Value function V ( s ) ∈ ℜ Q-function Q ( s , a ) ∈ ℜ

  8. Q-Function approximation Deep Neural Networks and Deep Reinforcement Learning ♦ State is a set of features: s = ( x 1 , x 2 , · · · , x n ) T CartPole: s = ( x , x ′ , θ, θ ′ ) T Atari, values of pixels ♦ Linear approximation: Q ( a , s ) ≈ � n i =1 w ai x i ♦ Non-linear ( neural network ): Q ( s , a ) ≈ g ( x ; w )

  9. Feed-Forward ANN Deep Neural ♦ Network of units (computational neurons) Networks and Deep ♦ DAG connecting functions with weighted edges Reinforcement Learning ♦ Each unit computes h ( w T x + b ) w : weights, x : inputs to node, b: bias h: activation function , usually non-linear

  10. One hidden layer ANN Deep Neural ♦ hidden units: z j = h 1 ( w (1) x + b (1) ) Networks and j j Deep ♦ output units: y k = h 2 ( w (2) k z + b (2) Reinforcement k ) Learning j w (2) i w (1) ji x i + b (1) ) + b (2) ♦ overall y k = h 2 ( � kj h 1 ( � k )) j w : weights, x : inputs to node, b: bias h: activation function , usually non-linear

  11. Activation function h Deep Neural Networks and Deep Reinforcement � 1 if x ≥ 0 Learning ♦ threshold: h ( x ) = − 1 otherwise 1 ♦ sigmoid: h ( x ) = σ ( x ) = 1+ e − x ♦ tanh h ( x ) = tanh ( x ) = e x − e − x e x + e − x 2 ( x − µ ♦ gaussian h ( x ) = e − 1 σ ) 2 ♦ identity h ( x ) = x ♦ rectified linear (ReLU) h ( x ) = max { 0 , x }

  12. Universal approximation property Deep Neural Networks and Deep Reinforcement Learning Theorem (Hornik et al., 1989, Cybenko, 1989) : a feedforward network with a linear output layer and at least one hidden layer with any "squashing" activation function (sigmoid/tahnh/gaussian) can approximate any function arbitrarely closely, provided that the network is given enough hidden units. ♦ any: continuous function on a closed an bounded subset of ℜ n (relationship with Borel measurability)

  13. Minimize least squared error Deep Neural Networks and ♦ Key idea to optimize the weights: minimize the error with Deep Reinforcement respect to the output (Loss) Learning E n ( W ) 2 = | f ( x n ; W ) − y n | 2 � � E ( W ) = 2 n n ♦ Non convex optimization problem: can train by using gradient descent Given sample ( x n , y n ) update weights as follows: w ji ← w ji − η ∂ E n ∂ w ji Backpropagation algorithm to compute the gradient in a ANN

  14. Deep Neural Networks Deep Neural Networks and Deep Reinforcement Learning ♦ Deep NN: ANN with many hidden layers ♦ Benefit: high expressivity (i.e., compact representation) ♦ Issues: can we train Deep NN in the same way ? can we avoid overfitting ?

  15. Example: Image Classification Deep Neural Networks and ♦ ImageNet Large Scale Visual Recognition Challenge Deep Reinforcement Learning

  16. Vanishing Gradient Deep Neural Networks and ♦ Deep Neural networks that uses "squashing" functions (e.g., Deep Reinforcement sigmoid, tanh) suffer from vanishing gradients Learning

  17. Sigmoid and Hyperbolic functions Deep Neural Networks and Deep Reinforcement ♦ Derivative of Sigmoid and Tanh is always less than one! Learning ♦ when back propagating gradients we multiply several numbers that are less than one

  18. Example: vanishing gradient ♦ y = t ( w 3 t ( w 2 t ( w 1 x ))), where t ( · ) is the tanh function Deep Neural Networks and Deep ♦ common weight initialization in (-1,1) Reinforcement Learning ♦ tanh function and its derivative are less than 1 ♦ vanishing gradient ∂ y ∂ w 3 = t ′ ( a 3 ) t ( a 2 ) ∂ y ∂ y ∂ w 2 = t ′ ( a 3 ) w 3 t ′ ( a 2 ) t ( a 1 ) ≤ ∂ w 3 ∂ y ∂ y ∂ w 1 = t ′ ( a 3 ) w 3 t ′ ( a 2 ) w 2 t ′ ( a 1 ) x ≤ ∂ w 2

  19. Mitigations for vanishing gradient Deep Neural Networks and Deep Reinforcement Learning ♦ typical solutions to mitigate vanishing gradient Pre-training Rectified Linear Units Batch normalization Skip connections

  20. Rectified Linear Units (ReLU) Deep Neural ♦ Rectified linear h ( x ) = max (0 , x ) Networks and Deep Reinforcement Gradient is 0 or 1 Learning Piecewise linear Sparse computation ♦ Soft computation (Softplus): h ( x ) = log (1 + e x ) ♦ Softplus does not mitigate gradient vanishing

  21. Deep Reinforcement Learning: key points Deep Neural Networks and Deep Reinforcement Learning ♦ For many real world domains we can not explicitly represent key functions for RL ( π ( s ), V ( s ), Q ( s , a )) ♦ We can try to approximate them Linear approximation Neural Network approximation Deep RL ♦ Deep Q Network approximates Q ( s , a ) with a DNN

  22. Gradient Q-Learning Deep Neural Networks and Deep ♦ approximate Q ( s , a ) with a parametrized function Q w ( s , a ) Reinforcement Learning ♦ Minimize squared error between estimate and target Estimate Q w ( s , a ) Target: r ( s , a , s ′ ) + γ max a ′ Q w ( s ′ , a ′ ) ♦ squared error: Err ( w ) = ( Q w ( s , a ) − r ( s , a , s ′ ) − γ max a ′ Q w ( s ′ , a ′ )) 2 ♦ gradient: ∂ Err ( w ) = 2( Q w ( s , a ) − r ( s , a , s ′ ) − γ max a ′ Q w ( s ′ , a ′ )) ∂ Q w ( s , a ) ∂ w ∂ w (Scalar 2 is a constant factor and not important for update)

  23. Gradient Q-Learning Algorithm Deep Neural Networks and Deep Reinforcement Learning Algorithm 1 Gradient Q-Learning 1: Initialize weights w randomly in [ − 1 , 1] 2: Initialize s {observe current state} 3: loop Select and execute action a 4: Observe new state s ′ receive immediate reward r 5: ∂ Err ( w ) = ( Q w ( s , a ) − r − γ max a ′ Q w ( s ′ , a ′ )) ∂ Q w ( s , a ) 6: ∂ w ∂ w update weights w ← w − α ∂ Err ( w ) 7: ∂ w update state s ← s ′ 8: 9: end loop

  24. Convergence of tabular Q-Learning Deep Neural Networks and Deep Reinforcement Learning ♦ Q ( s , a ) ← Q ( s , a )+ α ( r ( s , a , s ′ )+ γ max a ′ Q ( s ′ , a ′ ) − Q ( s , a )) ♦ Tabular Q-Learning converges to optimal policy if you explore enough if you make the learning rate small enough ... but not decrease it too quickly α = 1 / n ( s , a ), where n ( s , a ) is number of visits for ( s , a )

  25. Convergence of linear gradient Q-Learning Deep Neural Networks and Deep Reinforcement Learning i w ai x i = w T x ♦ linear approximation of Q(s,a), Q ( s , a ) ≈ � ♦ α t = 1 / t ♦ gradient Q-learning a ′ Q w ( s ′ , a ′ )) ∂ Q w ( s , a ) w ← w − α t ( Q w ( s , a ) − r − γ max ∂ w

  26. Non-Convergence of Non-linear gradient Q-Learning Deep Neural Networks and Deep Reinforcement ♦ Non-linear approximation of Q(s,a), Q ( s , a ) ≈ g ( x ; w ) Learning ♦ Even if α t = 1 / t , gradient Q-Learning may not converge ♦ Issue: we update the weights to reduce error for a specific experience (i.e., a specific ( s , a )) but by changing the weights we may end up changing the Q ( s , a ) potentially everywhere. this is true also for linear approximation, but in that case convergence can still be guaranteed

  27. Mitigating divergence Deep Neural Networks and Deep Reinforcement Learning ♦ Two main approaches to mitigate divergence: 1 experience replay 2 use two different networks Q-network Target network

  28. Experience replay Deep Neural Networks and Deep ♦ Store previous experiences (i.e., ( s , a , s ′ , r )) and use them at Reinforcement Learning each step Store previous ( s , a , s ′ , r ) in a dedicated memory buffer At each step sample a mini-batch from this buffer and use the mini-batch to update the weights ♦ Benefits 1 reduces correlation between successive samples (increase stability) 2 reduces number of interaction with the environment (increase data efficiency)

Recommend


More recommend