Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning? Spring 2019 Created:

  Framing the problem Why reinforcement learning? Reinforcement of agent's actions via rewards Supervised learning: need labeled examples Current state → choose action → new state + reward Unsupervised learning: maybe learn structure, but… Let = reward for state s R ( s ) Often: Many states may have 0 reward: Do not have labeled examples s 0 → a 1 → s 1 → a 2 → ⋯ a n → s n Have to do something – i.e., make some decision – before R ( s 0 ) = R ( s 1 ) = ⋯ R ( s n − 1 ) = 0 training is complete E.g., games But have some feedback about how agent is doing Instance of credit assignment problem Instance of sequential decision problem

  Reinforcement learning Reinforcement learning Rewards But no a priori knowledge of rewards, model (transition function) E.g.: Given an unfamiliar board and pieces, alternate moves with opponent – only feedback is "you win" or "you lose" Robot has to move around campus delivering mail, but doesn't know anything about campus, or delivering mail, or people, or… feedback: "good robot", "ouch!", falls over, etc. (From ) Learning approaches Learning approaches Learn utilities of states Use to select action to maximize expected outcome utility Passive learning: Needs model of environment, though to know resulting from s ′ Policy is fi xed taking action in a s Task: learn (or utility of state-action pairs) U ( s ) Policy learning (re fl ex agent): Maybe learn model Directly learn : which action to take in , bypassing π ( s ) s U ( s ) Active learning: Q-learning : Has to learn what to do Learn an action-utility function Q May not even know what its actions do is the value (utility) of action in state Q ( a , s ) a s Involves exploration Model-less learning

  Passive reinforcement learning Policy is fi xed π ( s ) Task: See how good policy is by learning: ( s ) = E [ R ( ) ] ∞ ∑ U π γ t s t t =0 Doesn't know: transition model s ′ P ( | s , a ) reward function R ( s ) Approach: Do series of trials Each: start at start, follow policy to terminal state s ′ R ( ) Percepts ⇒ new state , s ′ Stochastic transitions ⇒ di ff erent histories from same π

  Direct estimation of U π ( s ) Woodrow & Hu ff (1960 – adaptive control theory = remaining reward = reward-to-go U ( s ) View: each trial ⇒ one sample of reward-to-go for each visited Adaptive dynamic programming state First learn model of transition function s ′ from trials Reduces reinforcement learning to supervised learning P ( | s , a ) Now you have an MDP But although and s ′ are independent… R ( s ) R ( ) Solve it as per sequential decision process … and s ′ are not independent – (cf. Bellman equation) U ( s ) U ( ) Could use Bayesian approaches to make this better (see R&N, Misses opportunities for learning – e.g., 21.2.2) See for fi rst time, it leads to known state that is known s 1 s 2 Bellman: tells us something about U ( s 2 ) U ( s 1 ) Direct estimation: only matters R ( s 1) Hypothesis space > needs to be Temporal difference learning Use the Bellman equations directly: ′ U π s ′ ∑ U π s ′ ( s ) = R ( s ) + γ ( P ( | s , π ( s )) ( ) s Temporal difference RL algorithm General idea: Start with no known U ( ⋅ ) Iterate: Take step to give s ′ π ( s ) If is unknown state, use as s ′ s ′ s ′ R ( ) U ( ) Use s ′ to adjust : U ( ) U ( s ) U π s ′ U π U π U π ( s ) ← ( s ) + α ( R ( s ) + γ ( ) − ( s ))

  Active reinforcement learning What if we not only don't know: s ′ P ( | s , a ) R ( s ) …also don't know ? π ( s ) One approach: use passive learning, but for all possible actions Use the adaptive dynamic programming agent, but for all at each state a ∈ A ( s ) This gives the transition model Use value iteration or policy iteration ⇒ U ( s )


