Monte Carlo Control CMPUT 366: Intelligent Systems S&B §5.3-5.5, 5.7
Lecture Outline 1. Recap 2. Estimating Action Values 3. Monte Carlo Control 4. Importance Sampling 5. Off-Policy Monte Carlo Control
Recap: Monte Carlo vs. Dynamic Programming • Iterative policy evaluation uses the estimates of the next state's value to update the value of this state • Only needs to compute a single transition to update s a state's estimate π • Monte Carlo estimate of each state's value is a π independent from estimates of other states' values r p s 0 • Needs the entire episode to compute an update • Can focus on evaluating a subset of states if desired
First-visit Monte Carlo Prediction First-visit MC prediction, for estimating V ≈ v π Input: a policy π to be evaluated Initialize: V ( s ) ∈ R , arbitrarily, for all s ∈ S Returns ( s ) ← an empty list, for all s ∈ S Loop forever (for each episode): Generate an episode following π : S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , . . . , S T − 1 , A T − 1 , R T G ← 0 Loop for each step of episode, t = T − 1 , T − 2 , . . . , 0: G ← γ G + R t +1 Unless S t appears in S 0 , S 1 , . . . , S t − 1 : Append G to Returns ( S t ) V ( S t ) ← average( Returns ( S t ))
Control vs. Prediction • Prediction: estimate the value of states and/or actions given some fixed policy π • Control: estimate an optimal policy
Estimating Action Values • When we know the dynamics , an estimate of state values is p ( s ′ � , r ∣ s , a ) sufficient to determine a good policy : • Choose the action that gives the best combination of reward and next- state value • If we don't know the dynamics, state values are not enough • To estimate a good policy, we need an explicit estimate of action values
Exploring Starts • We can just run first-visit Monte Carlo and approximate the returns to each state-action pair • Question: What do we do about state-action pairs that are never visited ? • If the current policy never selects an action from a state , then a s π Monte Carlo can't estimate its value • Exploring starts assumption: • Every episode starts at a state-action pair S 0 , A 0 • Every pair has a positive probability of being selected for a start
Monte Carlo Control Monte Carlo control can be used for policy iteration : evaluation Q � q π e t Q π π � greedy( Q ) t improvement E I E I E I E − → q π 0 − → π 1 − → q π 1 − → π 2 − → · · · − → π ∗ − → q ∗ π 0
Monte Carlo Control with Exploring Starts Monte Carlo ES (Exploring Starts), for estimating π ≈ π ∗ Initialize: π ( s ) ∈ A ( s ) (arbitrarily), for all s ∈ S Q ( s, a ) ∈ R (arbitrarily), for all s ∈ S , a ∈ A ( s ) Returns ( s, a ) ← empty list, for all s ∈ S , a ∈ A ( s ) Loop forever (for each episode): Choose S 0 ∈ S , A 0 ∈ A ( S 0 ) randomly such that all pairs have probability > 0 Generate an episode from S 0 , A 0 , following π : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T G ← 0 Loop for each step of episode, t = T − 1 , T − 2 , . . . , 0: G ← γ G + R t +1 Unless the pair S t , A t appears in S 0 , A 0 , S 1 , A 1 . . . , S t − 1 , A t − 1 : Append G to Returns ( S t , A t ) Q ( S t , A t ) ← average( Returns ( S t , A t )) π ( S t ) ← argmax a Q ( S t , a ) Question: What unlikely assumptions does this rely upon?
𝜁 -Soft Policies • The exploring starts assumption ensures that we see every state-action pair with positive probability • Even if never chooses from state a s π • Another approach: Simply force to (sometimes) choose ! a π • An -soft policy is one for which π ( a ∣ s ) ≥ ϵ ∀ s , a ϵ • Example: -greedy policy ϵ ϵ if a ∉ arg max a Q ( s , a ), | | π ( a | s ) = ϵ otherwise. 1 − ϵ + | |
Monte Carlo Control w/out Exploring Starts On-policy first-visit MC control (for ε -soft policies), estimates π ⇡ π ⇤ Algorithm parameter: small ε > 0 Initialize: π an arbitrary ε -soft policy Q ( s, a ) 2 R (arbitrarily), for all s 2 S , a 2 A ( s ) Returns ( s, a ) empty list, for all s 2 S , a 2 A ( s ) Repeat forever (for each episode): Generate an episode following π : S 0 , A 0 , R 1 , . . . , S T � 1 , A T � 1 , R T G 0 Loop for each step of episode, t = T � 1 , T � 2 , . . . , 0: G γ G + R t +1 Unless the pair S t , A t appears in S 0 , A 0 , S 1 , A 1 . . . , S t � 1 , A t � 1 : Append G to Returns ( S t , A t ) Q ( S t , A t ) average( Returns ( S t , A t )) A ⇤ argmax a Q ( S t , a ) (with ties broken arbitrarily) For all a 2 A ( S t ): ⇢ 1 � ε + ε / | A ( S t ) | if a = A ⇤ π ( a | S t ) ε / | A ( S t ) | if a 6 = A ⇤
Monte Carlo Control w/out Exploring Starts On-policy first-visit MC control (for ε -soft policies), estimates π ⇡ π ⇤ Algorithm parameter: small ε > 0 Initialize: π an arbitrary ε -soft policy Q ( s, a ) 2 R (arbitrarily), for all s 2 S , a 2 A ( s ) Returns ( s, a ) empty list, for all s 2 S , a 2 A ( s ) Question: Repeat forever (for each episode): Generate an episode following π : S 0 , A 0 , R 1 , . . . , S T � 1 , A T � 1 , R T Will this procedure G 0 Loop for each step of episode, t = T � 1 , T � 2 , . . . , 0: converge to the G γ G + R t +1 optimal policy ? π * Unless the pair S t , A t appears in S 0 , A 0 , S 1 , A 1 . . . , S t � 1 , A t � 1 : Append G to Returns ( S t , A t ) Why or why not? Q ( S t , A t ) average( Returns ( S t , A t )) A ⇤ argmax a Q ( S t , a ) (with ties broken arbitrarily) For all a 2 A ( S t ): ⇢ 1 � ε + ε / | A ( S t ) | if a = A ⇤ π ( a | S t ) if a 6 = A ⇤ ε / | A ( S t ) |
Importance Sampling • Question: What was importance sampling the last time we studied it (in Supervised Learning?) • Monte Carlo sampling: use samples from the target distribution to estimate expectations • Importance sampling: Use samples from proposal distribution to estimate expectations of target distribution by reweighting samples 𝔽 [ X ] = ∑ f ( x ) x = ∑ g ( x ) f ( x ) x = ∑ f ( x i ) g ( x ) g ( x ) f ( x ) g ( x ) x ≈ 1 n ∑ g ( x i ) x i x i ∼ g x x x Importance sampling ratio
Off-Policy Prediction via Importance Sampling Definition: Off-policy learning means using data generated by a behaviour policy to learn about a distinct target policy . Proposal distribution Target distribution
Off-Policy Monte Carlo Prediction • Generate episodes using behaviour policy b • Take weighted average of returns to state s over all the episodes containing a visit to to estimate s v π ( s ) • Weighed by importance sampling ratio of trajectory starting from until the end of the episode: S t = s ρ t : T − 1 ≐ Pr[ A t , S t +1 , …, S T | S t , A t : T − 1 ∼ π ] Pr[ A t , S t +1 , …, S T | S t , A t : T − 1 ∼ b ]
Importance Sampling Ratios for Trajectories • Probability of a trajectory from : A t , S t +1 , A t +1 , …, S T S t Pr[ A t , S t +1 , …, S T | S t , A t : T − 1 ∼ π ] = π ( A t | S t ) p ( S t +1 | S t , A t ) π ( A t +1 | S t +1 )… p ( S T | S T − 1 , A T − 1 ) • Importance sampling ratio for a trajectory from : A t , S t +1 , A t +1 , …, S T S t ∏ T − 1 ∏ T − 1 k = t π ( A k | S k ) p ( S k +1 | S k , A k ) k = t π ( A k | S k ) ρ t : T − 1 ≐ = ∏ T − 1 ∏ T − 1 k = t b ( A k | S k ) k = t b ( A k | S k ) p ( S k +1 | S k , A k )
Ordinary vs.Weighted Importance Sampling • Ordinary importance sampling: n V ( s ) ≐ 1 ∑ ρ t ( s , i ): T ( i ) − 1 G i , t n i =1 • Weighted importance sampling: ∑ n i =1 ρ t ( s , i ): T ( i ) − 1 G i , t V ( s ) ≐ ∑ n i =1 ρ t ( s , i ): T ( i ) − 1
Example: Ordinary vs. Weighted Importance Sampling for Blackjack 5 Ordinary importance Mean sampling square error (average over 100 runs) Weighted importance sampling 0 0 10 100 1000 10,000 Episodes (log scale) Figure 5.3: Weighted importance sampling produces lower error estimates of the value of a single blackjack state from o ff -policy episodes. (Image: Sutton & Barto, 2018)
Off-Policy Monte Carlo Prediction O ff -policy MC prediction (policy evaluation) for estimating Q ⇡ q π Input: an arbitrary target policy π Initialize, for all s 2 S , a 2 A ( s ): Q ( s, a ) 2 R (arbitrarily) C ( s, a ) 0 Loop forever (for each episode): b any policy with coverage of π Generate an episode following b : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T G 0 W 1 Loop for each step of episode, t = T � 1 , T � 2 , . . . , 0, while W 6 = 0: G γ G + R t +1 C ( S t , A t ) C ( S t , A t ) + W W Q ( S t , A t ) Q ( S t , A t ) + C ( S t ,A t ) [ G � Q ( S t , A t )] W W π ( A t | S t ) b ( A t | S t )
Off-Policy Monte Carlo Control O ff -policy MC control, for estimating π ⇡ π ∗ Initialize, for all s 2 S , a 2 A ( s ): Q ( s, a ) 2 R (arbitrarily) C ( s, a ) 0 π ( s ) argmax a Q ( s, a ) (with ties broken consistently) Loop forever (for each episode): b any soft policy Generate an episode using b : S 0 , A 0 , R 1 , . . . , S T − 1 , A T − 1 , R T G 0 W 1 Loop for each step of episode, t = T � 1 , T � 2 , . . . , 0: G γ G + R t +1 C ( S t , A t ) C ( S t , A t ) + W W Q ( S t , A t ) Q ( S t , A t ) + C ( S t ,A t ) [ G � Q ( S t , A t )] π ( S t ) argmax a Q ( S t , a ) (with ties broken consistently) If A t 6 = π ( S t ) then exit inner Loop (proceed to next episode) 1 W W b ( A t | S t )
Recommend
More recommend