learning predictive state representations using non blind
play

Learning Predictive State Representations Using Non-Blind Policies - PowerPoint PPT Presentation

Learning Predictive State Representations Using Non-Blind Policies Michael Bowling Peter McCracken Michael James James Neufeld Dana Wilkinson University of Alberta Toyota Technical Center University of Waterloo ICML 2006 Bowling et al.


  1. Learning Predictive State Representations Using Non-Blind Policies Michael Bowling Peter McCracken Michael James James Neufeld Dana Wilkinson University of Alberta Toyota Technical Center University of Waterloo ICML 2006 Bowling et al. PSRs and Non-Blind Policies ICML 2006 1 / 18

  2. Outline Very Brief What is a PSR? 1 Tutorial Extracting PSRs from Data. 2 Short Prediction Estimators: 3 Punchline Problem and Solution Non-Blind Exploration 4 Bonus Bowling et al. PSRs and Non-Blind Policies ICML 2006 2 / 18

  3. Decision Process Action Observation a 1 , o 1 , a 2 , o 2 , . . . , a n , o n General Form Pr( o n +1 | a 1 , o 1 , . . . , a n , o n , a n +1 ) Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18

  4. Decision Process Action Observation a 1 , o 1 , a 2 , o 2 , . . . , a n , o n Markov Decision Process Pr( o n +1 | a 1 , o 1 , . . . , a n , o n , a n +1 ) = Pr( o n +1 | o n , a n +1 ) Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18

  5. Decision Process Action Observation a 1 , o 1 , a 2 , o 2 , . . . , a n , o n General Form Pr( o n +1 | a 1 , o 1 , . . . , a n , o n , a n +1 ) Bowling et al. PSRs and Non-Blind Policies ICML 2006 3 / 18

  6. Histories, Tests, and Predictions Notation History( h ) a 1 , o 1 , a 2 , o 2 , . . . , a n , o n Test( t ) (but in the future) a 1 , o 1 , a 2 , o 2 , . . . , a n , o n Prediction p ( t | h ) n � p ( a 1 , o 1 , . . . , a n , o n | h ) ≡ Pr( o i | ha 1 , o 1 , . . . , a i ) i =1 n � π ( a 1 , o 1 , . . . , a n , o n | h ) ≡ Pr( a i | ha 1 , o 1 , . . . , a i − 1 , o i − 1 ) i =1 Pr( t | h ) = p ( t | h ) π ( t | h ) Bowling et al. PSRs and Non-Blind Policies ICML 2006 4 / 18

  7. System Dynamics Matrix Countable number of Tests tests and histories. t Infinite matrix of all predictions. Histories p ( t | h ) h Bowling et al. PSRs and Non-Blind Policies ICML 2006 5 / 18

  8. POMDPs Underlying states. Tests Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

  9. POMDPs Underlying states. Tests States s 1 s 2 s 3 Tests s 4 Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

  10. POMDPs Underlying states. Tests Histories correspond to States s 1 belief states. s 2 s 3 Tests s 4 Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

  11. POMDPs Underlying states. Tests Histories correspond to  b 1 States s 1   belief states. b 2  s 2 b 3 s 3 Tests  History row is a linear  b 4  s 4 combination of state rows. Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

  12. POMDPs Underlying states. Tests Histories correspond to  b 1 States s 1   belief states. b 2  s 2 b 3 s 3 Tests  History row is a linear  b 4  s 4 combination of state rows. Histories ∴ rank(SDM) ≤ |S| Bowling et al. PSRs and Non-Blind Policies ICML 2006 6 / 18

  13. Predictive State Representations Find linearly independent Tests tests. Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18

  14. Predictive State Representations Find linearly independent Tests tests. q 1 q 2 q 3 “Core Tests” Histories Q Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18

  15. Predictive State Representations Find linearly independent Tests tests. t q 1 q 2 q 3 “Core Tests” Histories Q Any test is a linear combination of core tests. p ( t | h ) = p ( Q | h ) m t                  m t Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18

  16. Predictive State Representations Find linearly independent Tests tests. t q 1 q 2 q 3 “Core Tests” Histories Q Update predictions: p ( aoQ | h ) p ( Q | hao ) = p ( ao | h ) p ( Q | h ) M aoQ =                  p ( Q | h ) m ao m t Bowling et al. PSRs and Non-Blind Policies ICML 2006 7 / 18

  17. Extracting PSRs from Data Bowling et al. PSRs and Non-Blind Policies ICML 2006 8 / 18

  18. What Data? a 1 , o 1 , a 2 , o 2 , . . . , a n , o n Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18

  19. What Data? a 1 , o 1 , a 2 , o 2 , . . . , a n , o n How are actions chosen? Unknown policy. Known policy. Controlled policy. Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18

  20. What Data? a 1 , o 1 , a 2 , o 2 , . . . , a n , o n How are actions chosen? Unknown policy. Known policy. Controlled policy. Note Existing algorithms require a particular control policy. Either: Exhaustively trying history-test pairs, or Random actions. Bowling et al. PSRs and Non-Blind Policies ICML 2006 9 / 18

  21. Extracting PSRs from Data (James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006) The common formula: Tests Find core tests. Find update parameters. Histories Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18

  22. Extracting PSRs from Data (James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006) The common formula: Tests Find core tests. Find update parameters. Histories Estimate part of the system dynamics matrix. Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18

  23. Extracting PSRs from Data (James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006) The common formula: Tests t Find core tests. Find update parameters. Histories Estimate part of the system dynamics matrix. p ( t | h ) ˆ h Estimate a subset of predictions. Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18

  24. Extracting PSRs from Data (James & Singh, 2004) (Rosencrantz et al., 2004) (Wolfe et al., 2005) (Wiewiora, 2005) (McCracken & Bowling, 2006) The common formula: Tests t Find core tests. Find update parameters. Histories Estimate part of the system dynamics matrix. p ( t | h ) ˆ h Estimate a subset of predictions. p • ( t | h ) = # ha 1 o 1 . . . a n o n ˆ # ha 1 . . . a n Bowling et al. PSRs and Non-Blind Policies ICML 2006 10 / 18

  25. Problem � n i =1 Pr( a i | ha 1 o 1 . . . a i − 1 o i − 1 ) E [ˆ p • ( t | h )] = p ( t | h ) � n i =1 Pr( a i | ha 1 . . . a i − 1 ) Definition A policy is blind if actions are selected independent of preceeding observations. I.e., Pr( a n | a 1 , o 1 . . . a n − 1 , o n − 1 ) = Pr( a n | a 1 , . . . , a n ) Observation p • ( t | h ) is only an unbiased estimator of p ( t | h ) if π is blind. ˆ Bowling et al. PSRs and Non-Blind Policies ICML 2006 11 / 18

  26. What Data? a 1 , o 1 , a 2 , o 2 , . . . , a n , o n How are actions chosen? Unknown policy. Known policy. Controlled policy. Bowling et al. PSRs and Non-Blind Policies ICML 2006 12 / 18

  27. Prediction Estimators Policy is Known Policy is Not Known n p π ( t | h ) = # ht 1 # ha 1 o 1 . . . a i o i ˆ � p π ˆ × ( t | h ) = # h π ( t | h ) # ha 1 o 1 . . . a i i =1 Theorem p π ( t | h ) and ˆ ˆ × ( t | h ) are unbiased estimators of p ( t | h ) . p π Bowling et al. PSRs and Non-Blind Policies ICML 2006 13 / 18

  28. Exploration Goal Choose actions to reduce error in the estimated system dynamics matrix. Approach Add intelligent exploration to James & Singh’s “reset” algorithm. Since ˆ p π ( t | h ) is an unbiased estimator, we want to take actions to reduce the variance. Solve as an optimization problem. Bowling et al. PSRs and Non-Blind Policies ICML 2006 14 / 18

  29. Estimator Variance nπ ( t | h ) − p ( t | h ) 2 p ( t | h ) � � � p π ( t | h ) ˆ � # h = n = V n 1 ≤ 4 nπ ( t | h ) 2 1 � � � �� � p π ( t | h ) ˆ � # h = n � k trajectories ≤ E V 4 k p ( h ) π ( h ) π ( t | h ) 2 Bowling et al. PSRs and Non-Blind Policies ICML 2006 15 / 18

  30. Exploration Intuition Find the policy that maximizes the worst-case (over all predictions) bound on the root expected inverse variance. Optimization Problem �� � v i − 1 ( h, t ) − 1 + 2 � Maximize: min h,t k i p ( h ) π ( ht ) Subject to: Sequence form constraints on π ( ht ) : π ( φ ) = 1 , 1 π ( h ) = � ∀ h, o ∈ O a π ( hao ) , and 2 ∀ h, a ∈ A , { o, o ′ } ⊆ O π ( hao ) = π ( hao ′ ) . 3 Bowling et al. PSRs and Non-Blind Policies ICML 2006 16 / 18

  31. Results Tiger Paint 0.1 0.1 Non−blind Non−blind Testing Error Testing Error Random Random 0.01 0.01 0.001 0.001 1e−04 1e−04 60000 140000 220000 100000 300000 500000 Sample Size Sample Size Float−reset 0.1 Non−blind Testing Error Random 0.01 0.001 1e−04 0 100000 200000 Sample Size Bowling et al. PSRs and Non-Blind Policies ICML 2006 17 / 18

  32. Summary Contributions Unbiased prediction estimators for non-blind policies. Variance analysis in the case of a known policy. Estimators used in“intelligent” exploration, which was shown can speed learning. Future Work Better objective functions for exploration. Investigate when non-blind exploration proves helpful. Questions? Bowling et al. PSRs and Non-Blind Policies ICML 2006 18 / 18

Recommend


More recommend