reinforcement learning
play

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall - PowerPoint PPT Presentation

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) vs ML


  1. Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019

  2. Machine Learning Subfield of AI concerned with learning from data . Broadly, using: • Experience • To Improve Performance • On Some Task (Tom Mitchell, 1997)

  3. vs … ML vs Statistics vs Data Mining

  4. Why? Developing effective learning methods has proved difficult. Why bother? Autonomous discovery • We don’t know something, want to find out. Hard to program • Easier to specify task, collect data. Adaptive behavior • Our agents should adapt to new data, unforeseen circumstances.

  5. Types of Machine Learning Depends on feedback available : Labeled data: • Supervised learning No feedback, just data: • Unsupervised learning. Sequential data, weak labels: • Reinforcement learning

  6. Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y?

  7. Unsupervised Learning Input: inputs X = {x 1 , …, x n } Try to understand the structure of the data. E.g., how many types of cars? How can they vary?

  8. Reinforcement Learning Learning counterpart of planning. ∞ � γ t r t R = max π : S → A π t =0

  9. MDPs Agent interacts with an environment At each time t: • Receives sensor signal s t • Executes action a t • Transition : • new sensor signal s t +1 • reward r t Goal: find policy that maximizes expected return (sum π of discounted future rewards): � � ∞ � γ t r t E R = max π t =0

  10. Markov Decision Processes : set of states S : set of actions < S, A, γ , R, T > A : discount factor γ : reward function R is the reward received taking action from state R ( s, a, s ′ ) a s and transitioning to state . s ′ : transition function T is the probability of transitioning to state after s ′ T ( s ′ | s, a ) taking action in state . s a

  11. RL vs Planning In planning: • Transition function ( T ) known. • Reward function ( R ) known. • Computation “offline”. In reinforcement learning: • One or both of T, R unknown. • Action in the world only source of data. • Transitions are executed not simulated .

  12. Reinforcement Learning

  13. RL This formulation is general enough to encompass a wide variety of learned control problems.

  14. MDPs As before, our target is a policy : π : S → A A policy maps states to actions . The optimal policy maximizes: � � � ∞ � � γ t r t max R ( s ) = � s 0 = s ∀ s, E � � π t =0 This means that we wish to find a policy that maximizes the return from every state.

  15. Planning via Policy Iteration In planning, we used policy iteration to find an optimal policy. 1. Start with a policy π 2. Estimate V π 3. Improve Repeat π E [ r + γ V π ( s 0 )] , ∀ s π ( s ) = max a. a More precisely, we use a value function: can’t do this " ∞ anymore # X γ i r i V π ( s ) = E i =0 … then we would update by computing: π X T ( s, a, s 0 ) [ r ( s, a, s 0 ) + γ V [ s 0 ]] π ( s ) = argmax a s 0

  16. Value Functions For learning, we use a state-action value function as follows: " ∞ # X γ i r i | s 0 = s, a 0 = a Q π ( s, a ) = E i =0 This is the value of executing in state , then following . a s π Note that . V π ( s ) = Q π ( s, π ( s )) |A| x

  17. Policy Iteration This leads to a general policy improvement framework: 1. Start with a policy π 2. Learn Q π 3. Improve Repeat π a. π ( s ) = max Q ( s, a ) , ∀ s a Steps 2 and 3 can be interleaved as rapidly as you like. Usually, perform 3a every time step .

  18. 
 Value Function Learning Learning proceeds by gathering samples of . Q ( s, a ) Methods differ by: • How you get the samples. • How you use them to update . Q

  19. Monte Carlo Simplest thing you can do: sample . R ( s ) r r r r r r r r Do this repeatedly, average values: Q ( s, a ) = R 1 ( s ) + R 2 ( s ) + ... + R n ( s ) n

  20. <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> Temporal Difference Learning Where can we get more (immediate) samples? Idea : use the Bellman equation. Q π ( s, a ) = E s 0 [ r ( s, a, s 0 ) + γ Q π ( s 0 , π ( s 0 ))] reward value of this state value of next state

  21. TD Learning Ideally and in expectation: r i + γ Q ( s i +1 , a i +1 ) − Q ( s i , a i ) = 0 is correct if this holds in expectation for all states. Q When it does not: temporal difference error. s t s t +1 a t r t Q ( s t , a t ) ← r t + γ Q ( s t +1 , a t +1 )

  22. Sarsa Sarsa: very simple algorithm 1. Initialize Q[s][a] = 0 2. For n episodes • observe state s • select a = argmax a Q[s][a] • observe transition ( s, a, r, s ′ , a ′ ) • compute TD error δ = r + γ Q ( s ′ , a ′ ) − Q ( s, a ) • update Q: Q ( s, a ) = Q ( s, a ) + αδ • if not end of episode, repeat zero by def. if s is absorbing

  23. Sarsa

  24. Sarsa

  25. Exploration vs. Exploitation Always max a Q(s, a)? • Exploit current knowledge. What if your current knowledge is wrong? How are you going to find out? • Explore to gain new knowledge. Exploration is mandatory if you want to find the optimal solution, but every exploratory action may sacrifice reward. Exploration vs. Exploitation - when to try new things? Consistent theme of RL.

  26. Exploration vs. Exploitation How to balance? Simplest, most popular approach: Instead of always being greedy: • max a Q(s, a) Explore with probability : ✏ • max a Q(s, a) with probability . (1 − ✏ ) • random action with probability . ✏ - greedy exploration ( ✏ ≈ 0 . 1) ✏ • Very simple • Ensures asymptotic coverage of state space

  27. TD vs. MC TD and MC two extremes of obtaining samples of Q: r + γ V r + γ V r + γ V ... t=1 t=2 t=3 t=4 t=L � γ i r i i ... t=1 t=2 t=3 t=4 t=L

Recommend


More recommend