Value Iteration 3-21-16
Reading Quiz The Q function learned by Q-learning maps ________ to ________. a) state → action b) state → (action, expected reward) c) action → expected reward d) (state, action) → expected reward
Reinforcement learning setting ● We are trying to learn a policy that maps states to actions. ○ The state may be fully or partially observed. ■ We will focus on the fully-observable case. ○ Actions can have non-deterministic outcomes. ■ Transition probabilities are often unknown. ● Semi-supervised: we have partial information about this mapping. ● The agent receives occasional feedback in the form of rewards .
Reinforcement learning vs. other machine learning Supervised Semi-Supervised Unsupervised Output known for training Occasional feedback No feedback set Highly flexible; can learn Learn the agent function Learn representations many agent components (policy learning) Algorithms: ● Linear least squares Algorithms: Algorithms: ● Decision trees ● value iteration ● K-means (clustering) ● Naive Bayes ● Q-learning ● PCA (dimensionality ● K-nearest neighbors ● MCTS reduction) ● SVM
Reinforcement learning vs. state space search Search RL ● State is fully known. ● State is fully known. ● Actions are deterministic. ● Actions have random outcomes. ● Want to find a goal state. ● Want to maximize reward. ○ Finite horizon. ○ Infinite horizon. ● Come up with a plan to reach a ● Come up with a policy for what to goal state. do in each state.
A simple example: Grid World ● If actions were end deterministic, we could 2 +1 solve this with state space search. end ● (3,2) would be a goal state 1 -1 ● (3,1) would be a dead end start 0 0 1 2 3
A simple example: Grid World ● Suppose instead that end moves have a 0.8 chance 2 +1 of succeeding. ● With probability 0.1, the end agent goes in each 1 -1 perpendicular direction. ○ If impossible, stay in place. ● Now any given plan may start 0 not succeed. 0 1 2 3
Value Iteration values = {each state : 0} loop ITERATIONS times: previous = copy of values for all states: EVs = {each legal action : 0} for all legal actions: for each possible next_state: EVs[action] += prob * previous[next_state] values[state] = reward(state) + discount * max(EVs)
Exercise: continue carrying out value iteration discount = .9 0 0 .72 +1 2 0 0 -1 1 0 0 0 0 0 0 1 2 3
Exercise: continue carrying out value iteration discount = .9 0 .52 .78 +1 2 0 .43 -1 1 0 0 0 0 0 0 1 2 3
What do we do with the values? When values have converged, the optimal policy is to select .64 .74 .85 +1 2 the action with the highest expected value at each state. EV(u, (0,0)) = .8*.57 + .1*.43 .57 .57 -1 1 + .1*.49 = .548 EV(r, (0,0)) = .8*.43 + .1*.57 .49 .43 .48 .28 0 + .1*.49 = .45 0 1 2 3
What if we don’t know the transition probabilities? The only way to figure out the transition probabilities is to explore. We now need two things: ● A policy to use while exploring. ● A way to learn expected values without knowing exact transition probabilities.
Recommend
More recommend