Learning Predictive State Representations Using Non-Blind Policies Michael Bowling Peter McCracken Michael James James Neufeld Dana Wilkinson University of Alberta Toyota Technical Center University of Waterloo ICML 2006 Bowling et al. PSRs and Non-Blind Policies ICML 2006 1 / 18
Outline Very Brief What is a PSR? 1 Tutorial Extracting PSRs from Data. 2 Short Prediction Estimators: 3 Punchline Problem and Solution Non-Blind Exploration 4 Bonus Bowling et al. PSRs and Non-Blind Policies ICML 2006 2 / 18
Decision Process Action Observation a 1 , o 1 , a 2 , o 2 , . . . , a n , o n General Form Pr( o n +1 | a 1 , o 1 , . . . , a n , o n , a n +1 ) Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18
Decision Process Action Observation a 1 , o 1 , a 2 , o 2 , . . . , a n , o n Markov Decision Process Pr( o n +1 | a 1 , o 1 , . . . , a n , o n , a n +1 ) = Pr( o n +1 | o n , a n +1 ) Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18
Decision Process Action Observation a 1 , o 1 , a 2 , o 2 , . . . , a n , o n General Form Pr( o n +1 | a 1 , o 1 , . . . , a n , o n , a n +1 ) Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18
Histories, Tests, and Predictions Notation History( h ) a 1 , o 1 , a 2 , o 2 , . . . , a n , o n Test( t ) (but in the future) a 1 , o 1 , a 2 , o 2 , . . . , a n , o n Prediction p ( t | h ) n � p ( a 1 , o 1 , . . . , a n , o n | h ) ≡ Pr( o i | ha 1 , o 1 , . . . , a i ) i =1 n � π ( a 1 , o 1 , . . . , a n , o n | h ) ≡ Pr( a i | ha 1 , o 1 , . . . , a i − 1 , o i − 1 ) i =1 Pr( t | h ) = p ( t | h ) π ( t | h ) Bowling et al. PSRs and Non-Blind Policies ICML 2006 4 / 18
System Dynamics Matrix Countable number of Tests tests and histories. t Infinite matrix of all predictions. Histories p ( t | h ) h Bowling et al. PSRs and Non-Blind Policies ICML 2006 5 / 18
POMDPs Underlying states. Tests Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18
POMDPs Underlying states. Tests States s 1 s 2 s 3 Tests s 4 Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18
POMDPs Underlying states. Tests Histories correspond to States s 1 belief states. s 2 s 3 Tests s 4 Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18
POMDPs Underlying states. Tests Histories correspond to b 1 States s 1 belief states. b 2 s 2 b 3 s 3 Tests History row is a linear b 4 s 4 combination of state rows. Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18
POMDPs Underlying states. Tests Histories correspond to b 1 States s 1 belief states. b 2 s 2 b 3 s 3 Tests History row is a linear b 4 s 4 combination of state rows. Histories ∴ rank(SDM) ≤ |S| Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18
Predictive State Representations Find linearly independent Tests tests. Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18
Predictive State Representations Find linearly independent Tests tests. q 1 q 2 q 3 “Core Tests” Histories Q Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18
Predictive State Representations Find linearly independent Tests tests. t q 1 q 2 q 3 “Core Tests” Histories Q Any test is a linear combination of core tests. p ( t | h ) = p ( Q | h ) m t m t Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18
Predictive State Representations Find linearly independent Tests tests. t q 1 q 2 q 3 “Core Tests” Histories Q Update predictions: p ( aoQ | h ) p ( Q | hao ) = p ( ao | h ) p ( Q | h ) M aoQ = p ( Q | h ) m ao m t Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18
Extracting PSRs from Data Bowling et al. PSRs and Non-Blind Policies ICML 2006 8 / 18
What Data? a 1 , o 1 , a 2 , o 2 , . . . , a n , o n Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18
What Data? a 1 , o 1 , a 2 , o 2 , . . . , a n , o n How are actions chosen? Unknown policy. Known policy. Controlled policy. Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18
What Data? a 1 , o 1 , a 2 , o 2 , . . . , a n , o n How are actions chosen? Unknown policy. Known policy. Controlled policy. Note Existing algorithms require a particular control policy. Either: Exhaustively trying history-test pairs, or Random actions. Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18
Extracting PSRs from Data (James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006) The common formula: Tests Find core tests. Find update parameters. Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18
Extracting PSRs from Data (James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006) The common formula: Tests Find core tests. Find update parameters. Histories Estimate part of the system dynamics matrix. Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18
Extracting PSRs from Data (James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006) The common formula: Tests t Find core tests. Find update parameters. Histories Estimate part of the system dynamics matrix. p ( t | h ) ˆ h Estimate a subset of predictions. Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18
Extracting PSRs from Data (James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006) The common formula: Tests t Find core tests. Find update parameters. Histories Estimate part of the system dynamics matrix. p ( t | h ) ˆ h Estimate a subset of predictions. p • ( t | h ) = # ha 1 o 1 . . . a n o n ˆ # ha 1 . . . a n Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18
Problem � n i =1 Pr( a i | ha 1 o 1 . . . a i − 1 o i − 1 ) E [ˆ p • ( t | h )] = p ( t | h ) � n i =1 Pr( a i | ha 1 . . . a i − 1 ) Definition A policy is blind if actions are selected independent of preceeding observations. I.e., Pr( a n | a 1 , o 1 . . . a n − 1 , o n − 1 ) = Pr( a n | a 1 , . . . , a n ) Observation p • ( t | h ) is only an unbiased estimator of p ( t | h ) if π is blind. ˆ Bowling et al. PSRs and Non-Blind Policies ICML 2006 11 / 18
What Data? a 1 , o 1 , a 2 , o 2 , . . . , a n , o n How are actions chosen? Unknown policy. Known policy. Controlled policy. Bowling et al. PSRs and Non-Blind Policies ICML 2006 12 / 18
Prediction Estimators Policy is Known Policy is Not Known n p π ( t | h ) = # ht 1 # ha 1 o 1 . . . a i o i ˆ � p π ˆ × ( t | h ) = # h π ( t | h ) # ha 1 o 1 . . . a i i =1 Theorem p π ( t | h ) and ˆ ˆ × ( t | h ) are unbiased estimators of p ( t | h ) . p π Bowling et al. PSRs and Non-Blind Policies ICML 2006 13 / 18
Exploration Goal Choose actions to reduce error in the estimated system dynamics matrix. Approach Add intelligent exploration to James & Singh’s “reset” algorithm. Since ˆ p π ( t | h ) is an unbiased estimator, we want to take actions to reduce the variance. Solve as an optimization problem. Bowling et al. PSRs and Non-Blind Policies ICML 2006 14 / 18
Estimator Variance nπ ( t | h ) − p ( t | h ) 2 p ( t | h ) � � � p π ( t | h ) ˆ � # h = n = V n 1 ≤ 4 nπ ( t | h ) 2 1 � � � �� � p π ( t | h ) ˆ � # h = n � k trajectories ≤ E V 4 k p ( h ) π ( h ) π ( t | h ) 2 Bowling et al. PSRs and Non-Blind Policies ICML 2006 15 / 18
Exploration Intuition Find the policy that maximizes the worst-case (over all predictions) bound on the root expected inverse variance. Optimization Problem �� � v i − 1 ( h, t ) − 1 + 2 � Maximize: min h,t k i p ( h ) π ( ht ) Subject to: Sequence form constraints on π ( ht ) : π ( φ ) = 1 , 1 π ( h ) = � ∀ h, o ∈ O a π ( hao ) , and 2 ∀ h, a ∈ A , { o, o ′ } ⊆ O π ( hao ) = π ( hao ′ ) . 3 Bowling et al. PSRs and Non-Blind Policies ICML 2006 16 / 18
Results Tiger Paint 0.1 0.1 Non−blind Non−blind Testing Error Testing Error Random Random 0.01 0.01 0.001 0.001 1e−04 1e−04 60000 140000 220000 100000 300000 500000 Sample Size Sample Size Float−reset 0.1 Non−blind Testing Error Random 0.01 0.001 1e−04 0 100000 200000 Sample Size Bowling et al. PSRs and Non-Blind Policies ICML 2006 17 / 18
Summary Contributions Unbiased prediction estimators for non-blind policies. Variance analysis in the case of a known policy. Estimators used in“intelligent” exploration, which was shown can speed learning. Future Work Better objective functions for exploration. Investigate when non-blind exploration proves helpful. Questions? Bowling et al. PSRs and Non-Blind Policies ICML 2006 18 / 18
Recommend
More recommend