Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning UMaine COS 470/570 – Introduction to AI Why reinforcement learning? Spring 2019 Created: 2019-04-23 Tue 13:56 1 2 . 1 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 1 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 2 of 57 Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Framing the problem Why reinforcement learning? Reinforcement of agent’s actions via rewards Supervised learning: need labeled examples Current state → choose action → new state + reward Unsupervised learning: maybe learn structure, but… Let = reward for state s R ( s ) Often: Many states may have 0 reward: Do not have labeled examples s 0 → a 1 → s 1 → a 2 → ⋯ a n → s n Have to do something – i.e., make some decision – before R ( s 0 ) = R ( s 1 ) = ⋯ R ( s n − 1 ) = 0 training is complete E.g., games But have some feedback about how agent is doing Instance of credit assignment problem Instance of sequential decision problem 2 . 2 2 . 3 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 3 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 4 of 57
Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement learning Reinforcement learning Rewards But no a priori knowledge of rewards, model (transition function) E.g.: Given an unfamiliar board and pieces, alternate moves with opponent – only feedback is “you win” or “you lose” Robot has to move around campus delivering mail, but doesn’t know anything about campus, or delivering mail, or people, or… feedback: “good robot”, “ouch!”, falls over, etc. (From https://icml.cc/2016/tutorials/deep_rl_tutorial.pdf ) 2 . 4 2 . 5 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 5 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 6 of 57 Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Learning approaches Learning approaches Learn utilities of states Use to select action to maximize expected outcome utility Passive learning: Needs model of environment, though to know resulting from s ′ Policy is fi xed taking action in a s Task: learn (or utility of state-action pairs) U ( s ) Policy learning (re fl ex agent): Maybe learn model Directly learn : which action to take in , bypassing π ( s ) s U ( s ) Active learning: Q-learning : Has to learn what to do Learn an action-utility function Q May not even know what its actions do is the value (utility) of action in state Q ( a , s ) a s Involves exploration Model-less learning 2 . 6 2 . 7 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 7 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 8 of 57
Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Passive reinforcement learning Policy is fi xed π ( s ) Task: See how good policy is by learning: ( s ) = E [ R ( ) ] ∞ ∑ U π γ t s t Passive reinforcement learning t =0 Doesn’t know: transition model s ′ P ( | s , a ) reward function R ( s ) 3 . 1 3 . 2 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 9 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 10 of 57 Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Passive reinforcement learning Passive reinforcement learning Policy is fi xed Policy is fi xed π ( s ) π ( s ) Task: See how good policy is by learning: Task: See how good policy is by learning: ( s ) = E [ ( s ) = E [ R ( ) ] R ( ) ] ∞ ∞ ∑ ∑ U π γ t U π γ t s t s t t =0 t =0 Doesn’t know: Doesn’t know: transition model s ′ transition model s ′ P ( | s , a ) P ( | s , a ) reward function reward function R ( s ) R ( s ) Approach: Approach: Do series of trials Do series of trials Each: start at start, follow policy to terminal state Each: start at start, follow policy to terminal state s ′ R ( ) s ′ R ( ) Percepts ⇒ new state , Percepts ⇒ new state , s ′ s ′ Stochastic transitions ⇒ di ff erent histories from same π 3 . 2 3 . 2 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 11 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 12 of 57
Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM U π Direct estimation of ( s ) Woodrow & Hu ff (1960 – adaptive control theory = remaining reward = reward-to-go U ( s ) View: each trial ⇒ one sample of reward-to-go for each visited Adaptive dynamic programming state First learn model of transition function s ′ from trials Reduces reinforcement learning to supervised learning P ( | s , a ) Now you have an MDP But although and s ′ are independent… R ( s ) R ( ) Solve it as per sequential decision process … and s ′ are not independent – (cf. Bellman equation) U ( s ) U ( ) Could use Bayesian approaches to make this better (see R&N, Misses opportunities for learning – e.g., 21.2.2) See for fi rst time, it leads to known state that is known s 1 s 2 Bellman: tells us something about U ( s 2 ) U ( s 1 ) Direct estimation: only matters R ( s 1) Hypothesis space > needs to be 3 . 3 3 . 4 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 13 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 14 of 57 Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Temporal difference RL algorithm Temporal difference learning Use the Bellman equations directly: ′ U π s ′ ∑ U π s ′ ( s ) = R ( s ) + γ ( P ( | s , π ( s )) ( ) s General idea: Start with no known U ( ⋅ ) Iterate: Take step to give s ′ π ( s ) If is unknown state, use as s ′ s ′ s ′ R ( ) U ( ) Use s ′ to adjust : U ( ) U ( s ) U π s ′ U π U π U π ( s ) ← ( s ) + α ( R ( s ) + γ ( ) − ( s )) 3 . 5 3 . 6 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 15 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 16 of 57
Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Active reinforcement learning Active reinforcement learning 4 . 1 4 . 2 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 17 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 18 of 57 Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Reinforcement Learning<br/><br/> 4/25/19, 8*06 PM Active reinforcement learning Active reinforcement learning What if we not only don’t know: What if we not only don’t know: s ′ s ′ P ( | s , a ) P ( | s , a ) R ( s ) R ( s ) …also don’t know ? …also don’t know ? π ( s ) π ( s ) One approach: use passive learning, but for all possible actions Use the adaptive dynamic programming agent, but for all at each state a ∈ A ( s ) This gives the transition model Use value iteration or policy iteration ⇒ U ( s ) 4 . 2 4 . 2 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 19 of 57 file:///Users/rmt/Classes/COS470/2019-Spring/Slides/RL19/rl19.html?print-pdf Page 20 of 57
Recommend
More recommend