Value Iteration 3-21-16 Reading Quiz The Q function learned by - PowerPoint PPT Presentation

Value Iteration 3-21-16

Reading Quiz The Q function learned by Q-learning maps ________ to ________. a) state → action b) state → (action, expected reward) c) action → expected reward d) (state, action) → expected reward

Reinforcement learning setting ● We are trying to learn a policy that maps states to actions. ○ The state may be fully or partially observed. ■ We will focus on the fully-observable case. ○ Actions can have non-deterministic outcomes. ■ Transition probabilities are often unknown. ● Semi-supervised: we have partial information about this mapping. ● The agent receives occasional feedback in the form of rewards .

Reinforcement learning vs. other machine learning Supervised Semi-Supervised Unsupervised Output known for training Occasional feedback No feedback set Highly flexible; can learn Learn the agent function Learn representations many agent components (policy learning) Algorithms: ● Linear least squares Algorithms: Algorithms: ● Decision trees ● value iteration ● K-means (clustering) ● Naive Bayes ● Q-learning ● PCA (dimensionality ● K-nearest neighbors ● MCTS reduction) ● SVM

Reinforcement learning vs. state space search Search RL ● State is fully known. ● State is fully known. ● Actions are deterministic. ● Actions have random outcomes. ● Want to find a goal state. ● Want to maximize reward. ○ Finite horizon. ○ Infinite horizon. ● Come up with a plan to reach a ● Come up with a policy for what to goal state. do in each state.

A simple example: Grid World ● If actions were end deterministic, we could 2 +1 solve this with state space search. end ● (3,2) would be a goal state 1 -1 ● (3,1) would be a dead end start 0 0 1 2 3

A simple example: Grid World ● Suppose instead that end moves have a 0.8 chance 2 +1 of succeeding. ● With probability 0.1, the end agent goes in each 1 -1 perpendicular direction. ○ If impossible, stay in place. ● Now any given plan may start 0 not succeed. 0 1 2 3

Value Iteration values = {each state : 0} loop ITERATIONS times: previous = copy of values for all states: EVs = {each legal action : 0} for all legal actions: for each possible next_state: EVs[action] += prob * previous[next_state] values[state] = reward(state) + discount * max(EVs)

Exercise: continue carrying out value iteration discount = .9 0 0 .72 +1 2 0 0 -1 1 0 0 0 0 0 0 1 2 3

Exercise: continue carrying out value iteration discount = .9 0 .52 .78 +1 2 0 .43 -1 1 0 0 0 0 0 0 1 2 3

What do we do with the values? When values have converged, the optimal policy is to select .64 .74 .85 +1 2 the action with the highest expected value at each state. EV(u, (0,0)) = .8*.57 + .1*.43 .57 .57 -1 1 + .1*.49 = .548 EV(r, (0,0)) = .8*.43 + .1*.57 .49 .43 .48 .28 0 + .1*.49 = .45 0 1 2 3

What if we don’t know the transition probabilities? The only way to figure out the transition probabilities is to explore. We now need two things: ● A policy to use while exploring. ● A way to learn expected values without knowing exact transition probabilities.

Value Iteration 3-21-16 Reading Quiz The Q function learned by - PowerPoint PPT Presentation

Value Iteration 3-21-16 Reading Quiz The Q function learned by Q-learning maps to . a) state action b) state (action, expected reward) c) action expected reward d) (state, action) expected reward

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

Introduction to Mobile Robotics The Markov Decision Problem Value Iteration and Policy Iteration

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

Manipulating an Abstraction (Iteration) CT @ VT An algorithm with iteration START BOOK LIST =

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Combinatorial Newton iteration for Boltzmann oracle Carine Pivoteau joint work with Bruno Salvy

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George

Today Reminder: Constraint satisfaction problems See Russell and Norvig, chapters 5 and 6 CSP:

K-Means Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University

Review Lecture A Tiefenbruck MWF 9-9:50am Center 212 Lecture B Jones MWF 2-2:50pm Center

Dynamic Programming Dynamic Programming Steps. 9 View the problem solution as the result of a

7. Iterative Methods: Roots and Optima Citius, Altius, Fortius! 7. Iterative Methods: Roots and

Superiorized Inversion of the Radon Transform Gabor T. Herman Graduate Center, City University of

Depth-First Iterative-Deepening: An Optimal Admissible Tree Search by R. E. Korf Tsan-sheng Hsu

Review Linear separability (and use of features) Class probabilities for linear

Value Iteration 3-21-16 Reading Quiz The Q function learned by - PowerPoint PPT Presentation

Value Iteration 3-21-16 Reading Quiz The Q function learned by Q-learning maps ________ to ________. a) state action b) state (action, expected reward) c) action expected reward d) (state, action) expected reward

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

Introduction to Mobile Robotics The Markov Decision Problem Value Iteration and Policy Iteration

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

Manipulating an Abstraction (Iteration) CT @ VT An algorithm with iteration START BOOK LIST =

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Combinatorial Newton iteration for Boltzmann oracle Carine Pivoteau joint work with Bruno Salvy

Blockly Lists &amp; Iteration CT @ VT Things we are seeing Using lists to represent a data

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George

Today Reminder: Constraint satisfaction problems See Russell and Norvig, chapters 5 and 6 CSP:

K-Means Class Algorithmic Methods of Data Mining Program M. Sc. Data Science University

Review Lecture A Tiefenbruck MWF 9-9:50am Center 212 Lecture B Jones MWF 2-2:50pm Center

Dynamic Programming Dynamic Programming Steps. 9 View the problem solution as the result of a

7. Iterative Methods: Roots and Optima Citius, Altius, Fortius! 7. Iterative Methods: Roots and

Superiorized Inversion of the Radon Transform Gabor T. Herman Graduate Center, City University of

Depth-First Iterative-Deepening: An Optimal Admissible Tree Search by R. E. Korf Tsan-sheng Hsu

Review Linear separability (and use of features) Class probabilities for linear

Value Iteration 3-21-16 Reading Quiz The Q function learned by Q-learning maps to . a) state action b) state (action, expected reward) c) action expected reward d) (state, action) expected reward

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data