Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2019
Reinforcement Learning ∞ � γ t r t R = max π : S → A π t =0
MDPs Agent interacts with an environment At each time t: • Receives sensor signal s t • Executes action a t • Transition : • new sensor signal s t +1 • reward r t Goal: find policy that maximizes expected return (sum π of discounted future rewards): � � ∞ � γ t r t E R = max π t =0
Markov Decision Processes : set of states S : set of actions < S, A, γ , R, T > A : discount factor γ : reward function R is the reward received taking action from state R ( s, a, s ′ ) a s and transitioning to state . s ′ : transition function T is the probability of transitioning to state after s ′ T ( s ′ | s, a ) taking action in state . s a RL: one or both of T, R unknown.
The World
Real-Valued States What if the states are real-valued? • Cannot use table to represent Q. • States may never repeat: must generalize . 2.5 2 1.5 vs 1 0.5 0 100 80 60 40 70 80 90 20 40 50 60 30 10 20 0 0
RL Example: ( θ 1 , ˙ θ 1 , θ 2 , ˙ States : (real-valued vector) θ 2 ) Actions : +1, -1, 0 units of torque added to elbow Transition function : physics! Reward function : -1 for every step
<latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> <latexit sha1_base64="tLkFoSn5optzQBNl1oSl2Fs1uI=">ACWXicbZDdatAEIXatK67p/bXPZmiAk4JA1SKLQ3hdDc9DKBOglYjhitV/bi/RG7owYj9HB9jNIHyG36Bl0rLjQ/BxY+zplhlpOXSnqK41+d6MnG5tNn3e9Fy9fvX7Tf/vuzNvKcTHiVl3kaMXShoxIklKXJROoM6VOM8Xx6v8/IdwXlrznZalmGicGVlIjhSsrD9OtTZFaS+0lktv8TNpYFUiYKG4DIJe5DOUGuE06EP+V7S7AP+g6tdgA9tIleubJ3UydmcduHyMOsP4oO4FTyEZA0DtZJ1r9Op5ZXWhjiCr0fJ3FJkxodSa5E0srL0rkC5yJcUCDWvhJ3ZbQwE5wplBYF54haN3/N2rU3i91HiY10tzfz1bmY9m4ouLzpJamrEgYfnuoqBSQhVWjMJVOcFLAMidDH8FPkeHnELvd67k1i4Ic9+EZpL7PTyEs8ODJPDpx8HR13VHXfaebMhS9gndsS+sRM2Ypz9ZNfshv3p/I46UTfq3Y5GnfXOFrujaOsvf0yIg=</latexit> Value Function Approximation Represent Q function: Q ( s, a, w ) : R n → R parameter vector Samples of form: ( s i , a i , r i , s i +1 , a i +1 ) Minimize summed squared TD error: n ( r i + γ Q ( s i +1 , a i +1 , w ) − Q ( s i , a i , w )) 2 X min w i =0
Value Function Approximation Given a function approximator, compute the gradient and descend it. Which function approximator to use? Simplest thing you can do: • Linear value function approximation . • Use set of basis functions φ 1 , ..., φ n • Q is a linear function of them: n ˆ X Q ( s, a ) = w · Φ ( s, a ) = w j φ j ( s, a ) j =1
Function Approximation One choice of basis functions: • Just use state variables directly: [1 , x, y ] What can be represented this way? Q y x
Polynomial Basis More powerful: • Polynomials in state variables. • 1st order: [1 , x, y, xy ] • 2nd order: [1 , x, y, xy, x 2 , y 2 , x 2 y, y 2 x, x 2 y 2 ] • This is like a Taylor expansion. What can be represented?
Function Approximation How to get the terms of the Taylor series? Each term has an exponent: c i ∈ [0 , ..., d ] φ c ( x, y, z ) = x c 1 y c 2 z c 3 all combinations generates basis φ c ( x, y, z ) = x = x 1 y 0 z 0 c = [1 , 0 , 0] φ c ( x, y, z ) = xy 2 = x 1 y 2 z 0 c = [1 , 2 , 0] φ c ( x, y, z ) = x 2 z 4 = x 2 y 0 z 4 c = [2 , 0 , 4] φ c ( x, y, z ) = y 3 z 1 = x 0 y 3 z 1 c = [0 , 3 , 1]
Function Approximation Another: • Fourier terms on state variables. • [1 , cos ( π x ) , cos ( π y ) , cos( π [ x + y ])] • cos ( π c · [ x, y, z ]) coefficient vector
<latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> <latexit sha1_base64="uRPHOZKM+RHihkwjPmRBNXhn3t4=">ACaHicbZFLaxRBFIVrOj5ifLW6EHFzySDMEA3dQYibQNCNywhOEpieNLdrqmfKqUdTdTthaPo/ZptfIPgD3GrNY+EkXij4OcUtzhVEp6SpKbTrR17/6Dh9uPdh4/efrsefzi5am3teNiwK2y7rxAL5Q0YkCSlDivnEBdKHFWzL4s/LNL4by05jvNKzHSODGylBwpSHn8I9PS5FeQ+VrnjTxK2gsDmRIl9cDlEvYgm6DWCHCx5Ygq6ay50N0L23fA6gD/ABrjYCcmHKPmROTqbUh4uDPO4m+8ly4C6ka+iy9Zzk8c9sbHmthSGu0PthmlQ0atCR5Eq0O1ntRYV8hMxDGhQCz9qlp208C4oYyitC8cQLNV/bzSovZ/rIiQ10tTf9hbi/7xhTeWnUSNVZMwfLWorBWQhUXBMJZOcFLzAMidDG8FPkWHnMI3bGwprJ0RFr4NzaS3e7gLpwf7aeBvH7vHn9cdbO3bJf1WMoO2TH7yk7YgHF2zX6zPx3W+RXF0evozSoadZ3XrGNiXb/AutmuCU=</latexit> Objective Function Minimization First, let’s do s tochastic gradient descent. As each data point (transition) comes in • compute gradient of objective w.r.t. data point • descend gradient a little bit ˆ Q ( s, a ) = w · Φ ( s, a ) n ( r i + γ w · φ ( s i +1 , a i +1 ) − w · φ ( s i , a i )) 2 X min w i =0
Recommend
More recommend