value iteration
play

Value Iteration 3-21-16 Reading Quiz The Q function learned by - PowerPoint PPT Presentation

Value Iteration 3-21-16 Reading Quiz The Q function learned by Q-learning maps ________ to ________. a) state action b) state (action, expected reward) c) action expected reward d) (state, action) expected reward


  1. Value Iteration 3-21-16

  2. Reading Quiz The Q function learned by Q-learning maps ________ to ________. a) state → action b) state → (action, expected reward) c) action → expected reward d) (state, action) → expected reward

  3. Reinforcement learning setting ● We are trying to learn a policy that maps states to actions. ○ The state may be fully or partially observed. ■ We will focus on the fully-observable case. ○ Actions can have non-deterministic outcomes. ■ Transition probabilities are often unknown. ● Semi-supervised: we have partial information about this mapping. ● The agent receives occasional feedback in the form of rewards .

  4. Reinforcement learning vs. other machine learning Supervised Semi-Supervised Unsupervised Output known for training Occasional feedback No feedback set Highly flexible; can learn Learn the agent function Learn representations many agent components (policy learning) Algorithms: ● Linear least squares Algorithms: Algorithms: ● Decision trees ● value iteration ● K-means (clustering) ● Naive Bayes ● Q-learning ● PCA (dimensionality ● K-nearest neighbors ● MCTS reduction) ● SVM

  5. Reinforcement learning vs. state space search Search RL ● State is fully known. ● State is fully known. ● Actions are deterministic. ● Actions have random outcomes. ● Want to find a goal state. ● Want to maximize reward. ○ Finite horizon. ○ Infinite horizon. ● Come up with a plan to reach a ● Come up with a policy for what to goal state. do in each state.

  6. A simple example: Grid World ● If actions were end deterministic, we could 2 +1 solve this with state space search. end ● (3,2) would be a goal state 1 -1 ● (3,1) would be a dead end start 0 0 1 2 3

  7. A simple example: Grid World ● Suppose instead that end moves have a 0.8 chance 2 +1 of succeeding. ● With probability 0.1, the end agent goes in each 1 -1 perpendicular direction. ○ If impossible, stay in place. ● Now any given plan may start 0 not succeed. 0 1 2 3

  8. Value Iteration values = {each state : 0} loop ITERATIONS times: previous = copy of values for all states: EVs = {each legal action : 0} for all legal actions: for each possible next_state: EVs[action] += prob * previous[next_state] values[state] = reward(state) + discount * max(EVs)

  9. Exercise: continue carrying out value iteration discount = .9 0 0 .72 +1 2 0 0 -1 1 0 0 0 0 0 0 1 2 3

  10. Exercise: continue carrying out value iteration discount = .9 0 .52 .78 +1 2 0 .43 -1 1 0 0 0 0 0 0 1 2 3

  11. What do we do with the values? When values have converged, the optimal policy is to select .64 .74 .85 +1 2 the action with the highest expected value at each state. EV(u, (0,0)) = .8*.57 + .1*.43 .57 .57 -1 1 + .1*.49 = .548 EV(r, (0,0)) = .8*.43 + .1*.57 .49 .43 .48 .28 0 + .1*.49 = .45 0 1 2 3

  12. What if we don’t know the transition probabilities? The only way to figure out the transition probabilities is to explore. We now need two things: ● A policy to use while exploring. ● A way to learn expected values without knowing exact transition probabilities.

Recommend


More recommend