Reinforcement Learning
Reinforcement Learning • Now that you know a little about Optimal Control Theory, you actually have some knowledge in RL. • RL shares the overall goal with OCT: solving for a control policy such that the cumulative cost is minimized; good for solving problems which include a long-term versus short-term reward trade-off. • But OCT assumes perfect knowledge of the system’s description in the form of a model and ensures strong guarantees, while RL operates directly on measured data and rewards from interaction with the environment.
RL in Robotics • Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. • The designer of a control task provides feedback in terms of a scalar objective function that measures the one-step performance of the robot. • Problems are often high-dimensional with continuous states and actions, and the state is often partially observable. • Experience on a real physical system is tedious to obtain, expensive and often hard to reproduce.
Problem Definition • A reinforcement learning problem typical includes: • A set of states: S • A set of actions: A • Transition rules: P a S ⇥ S ⇥ A 7! R ss 0 X P a ss 0 = 1 0 ≤ P a ss 0 ≤ 1 s 0 • Reward function: S 7! R r • Here we assume full observability but with a stochastic transition model.
Long-term Expected Return • Finite-horizon expected return " H # X J = E r k k =0 • Infinite-horizon return with a discount factor γ " ∞ # X γ k r k J = E k =0 • In the limit when γ approaches 1, the metric approaches what is known as the average-reward criterion " H # 1 X J = lim H →∞ E r k H k =0
Value Function • Recall from optimal control theory, v ( x ) = “minimal total cost for completing the task starting from state x ” • An value function that follows a particular policy, Π V Π ( s ) : S 7! R ∞ X V Π ( s ) = E Π { R t | s t = s } = E Π { γ k r t + k +1 | s t = s } k =0 R t = r t +1 + γ r t +2 + γ 2 r t +3 + · · · where is the discount factor γ • The optimal value function: V Π ( s ) V ∗ ( s ) = max Π
Policy • Deterministic policy: S 7! A • Probabilistic policy: S ⇥ A 7! R • The optimal policy: Π ∗ = arg max V Π ( s ) Π
Exploration and Exploitation • To gain information about the rewards and the behavior of the system, the agent needs to explore by considering previously unused actions or actions it is uncertain about. • Need to decide whether to stick to well known actions with high rewards or to try new things in order to discover new strategies with an even higher reward. • This problem is commonly known as the exploration- exploitation trade-off.
Value Function Policy Search Approach Approach Policy Gradient Dynamic Program Expectation–Maximization Value Iteration Policy Iteration Information-Theoretic Monte Carlo Integral Path Temporal Difference TD(lambda) Actor-Critic Approach SARSA Q-learning
Bellman Equation • The expected long-term reward of a policy can be expressed in a recursive formulation. ∞ X V Π ( s ) = E Π { γ k r t + k +1 | s t = s } k =0 1 P Π ( s ) X X γ k r t + k +2 | s t +1 = s 0 } ) ( r ( s 0 ) + γ E Π { = ss 0 s 0 k =0 P Π ( s ) X ( r ( s 0 ) + γ V Π ( s 0 )) = ss 0 s 0
Value Iteration • Value iteration starts with a guess V (0) of the optimal value function and construct a sequence of improved guesses. P Π ( s ) V ( i +1) ( s ) = max X ( r ( s 0 ) + γ V ( i ) ( s 0 )) ss 0 Π s 0 • This process is guaranteed to converge to the optimal value function V in a finite number of iterations.
Policy Iteration • Find the optimal policy by iterating two procedures until convergence • Policy Evaluation • Policy Improvement
Policy Evaluation Input: Output: V Π Π Step 1: Arbitrarily initialize V ( s ) , ∀ s ∈ S Step 2: Repeat For each s a = Π ( s ) X ss 0 ( r ( s 0 ) + γ V ( s 0 )) V ( s ) = P a s 0 Until convergence Step 3: Output V ( s )
Policy Improvement Input: Output: V Π Π 0 Step 1: For each s X ss 0 ( r ( s 0 ) + γ V Π ( s 0 )) Π 0 ( s ) = arg max P a a s 0 Step 2: Output Π 0 ( s )
Monte Carlo Approach • Both value iteration and policy iteration use dynamic programming approach. • Dynamic programming approach requires a transition model, P, which is often unavailable in real world problem. • Monte Carlo algorithm does not require a model to be known. Instead, it generate samples to approximate the value function.
The Q function • Introduce the Q function: Q Π ( s, a ) : S ⇥ A 7! R Q Π ( s, a ) = E Π { R t | s t = s, a t = a } ∞ X γ k r t + k +1 | s t = s, a t = a } = E Π { k =0 • Use the Q ( s , a ) function instead of the value function V ( s ) because in the absence of transition model the values with respect to all possible actions at s must be stored explicitly. • The optimal Q Π ( s, a ) Q ∗ ( s, a ) = max Π V ∗ ( s ) = max Q ∗ ( s, a ) Π ∗ ( s ) = arg max Q ∗ ( s, a ) a a
Monte Carlo Policy Iteration 1 Step 1: Arbitrarily initialize Π ( s, a ) = Q ( s, a ) |A| and an empty list return( s, a ) ∀ s ∈ S ∀ a ∈ A Step 2: Repeat for many times Generate an episode using a 0 a 1 a 2 Π s 0 − → s 1 − → s 2 − → · · · For each pair in the episode ( s, a ) Policy Compute long-term return R from ( s, a ) evaluation Append R to return( s, a ) Assign average of to Q ( s, a ) return( s, a ) Policy Continue... improvement
Monte Carlo Policy Iteration For all s a ∗ = arg max Q ( s, a ) Policy a improvement 1 − ✏ if a = a ∗ Π ( s, a ) = ✏ if a = a ∗ |A| − 1 Step 3: Output Π ( s ) = arg max Q ( s, a ) a is a small number, which affects ✏ exploration and exploitation
Temporal Difference Learning • Problem with the Monte Carlo learning is that it takes a lot of time to simulate/execute the episodes. • Temporal Difference (TD) learning is a combination of Monte Carlo and dynamic programming. • Update the value function based on previously learned estimates.
Policy Iteration in TD 1 Step 1: Arbitrarily initialize Π ( s, a ) = Q ( s, a ) |A| Step 2: Repeat for each episode s = initial state of the episode a = generate a sample from Π ( s, a ) Repeat for each step in the episode s’ = new state by taking action a from s Continue...
Policy Iteration in TD a ⇤ = arg max Q ( s 0 , a ) a 1 − ✏ if a = a ∗ Π ( s 0 , a ) = ✏ if a = a ∗ |A| − 1 a’ = generate a sample from Π ( s 0 , a ) Q ( s, a ) = Q ( s, a ) + α [ r ( s 0 ) + γ Q ( s 0 , a 0 ) − Q ( s, a )] s = s’ a = a’ until s is the terminal state Step 3: Output Π ( s ) = arg max Q ( s, a ) a
n-step TD and Linear Combination 1-step TD 2-step TD Monte Carlo s s s a a a s’ s’ s’ a’ a’ (1 − λ ) s’’ s’’ (1 − λ ) λ 1-step TD method λ = 0 Monte Carlo method λ = 1 s n λ n − 1
Recommend
More recommend