(Learning to) Learn to Control Jan Kˇ ret´ ınsk´ y Technical University of Munich, Germany joint work with P . Ashok, T. Meggendorfer (TUM), T. Br´ azdil (Masaryk University Brno), K. Chatterjee, M. Chmel´ ık, P . Daca, A. Fellner, T. Henzinger, T. Petrov, V. Toman (IST Austria), V. Forejt, M. Kwiatkowska, M. Ujma (Oxford University) D. Parker (University of Birmingham) Dagstuhl seminar: Computer-Assisted Engineering for Robotics and Autonomous Systems February 14, 2017
Controller synthesis and verification 2/12
Controller synthesis and verification 2/12
Formal methods and machine learning 3/12 Formal methods + precise – scalability issues
Formal methods and machine learning 3/12 Formal methods + precise – scalability issues MEM-OUT
Formal methods and machine learning 3/12 Formal methods + precise – scalability issues
Formal methods and machine learning 3/12 Learning Formal methods – weaker guarantees + precise + scalable – scalability issues + simpler solutions – can be hard to use different objectives
Formal methods and machine learning 3/12 Learning Formal methods – weaker guarantees + precise + scalable – scalability issues + simpler solutions – can be hard to use
Formal methods and machine learning 3/12 Learning Formal methods – weaker guarantees + precise + scalable – scalability issues + simpler solutions – can be hard to use precise computation focus on important stuff
Examples 4/12 ◮ Reinforcement learning for efficient controller synthesis ◮ MDP with functional spec (reachability, LTL) 1 ◮ MDP with performance spec (mean payoff/average reward) 2 ◮ Decision tree learning for efficient controller representation ◮ MDP 3 ◮ Games 4 1 Brazdil, Chatterjee, Chmelik, Forejt, K., Kwiatkowska, Parker, Ujma: Verification of Markov Decision Processes Using Learning Algorithms. ATVA 2014 Daca, Henzinger, K., Petrov: Faster Statistical Model Checking for Unbounded Temporal Properties. TACAS 2016 2 Ashok, Chatterjee, Daca, K., Meggendorfer: Value Iteration for Long-run Average Reward in Markov Decision Processes. Submitted 3 Brazdil, Chatterjee, Chmelik, Fellner, K.: Counterexample Explanation by Learning Small Strategies in Markov Decision Processes. CAV 2015 4 Brazdil, Chatterjee, K., Toman: Strategy Representation by Decision Trees in Reactive Synthesis. Submitted
Example: Markov decision processes 5/12 p . . . 1 a 1 up b 0 . 5 · · · v 1 s init down 0 . 01 0 . 5 0 . 99 c t goal 1
Example: Markov decision processes 5/12 p . . . 1 a 1 up b 0 . 5 · · · v 1 s init down 0 . 01 0 . 5 0 . 99 c t goal 1 controller σ P σ [ � goal ] max
Example: Markov decision processes 5/12 p . . . 1 a 1 up b b 0 . 5 0 . 5 · · · v 1 s init down down 0 . 01 0 . 01 0 . 5 0 . 5 0 . 99 0 . 99 c c t goal 1 1 controller σ P σ [ � goal ] max
Example: Markov decision processes 5/12 p . . . 1 a 1 up b b 0 . 5 0 . 5 · · · v 1 s init down down 0 . 01 0 . 01 0 . 5 0 . 5 0 . 99 0 . 99 c c t goal 1 1 controller σ P σ [ � goal ] max
Example: Markov decision processes 5/12 p . . . 1 a 1 up b b 0 . 5 0 . 5 · · · v 1 s init down down 0 . 01 0 . 01 0 . 5 0 . 5 0 . 99 0 . 99 c c t goal 1 1 controller σ P σ [ � goal ] max
Example: Markov decision processes 5/12 p . . . 1 a 1 up b 0 . 5 · · · v 1 s init down down 0 . 01 0 . 01 0 . 5 0 . 99 0 . 99 c t goal 1 controller σ P σ [ � goal ] max
Example: Markov decision processes 5/12 p . . . 1 a 1 up b 0 . 5 · · · v 1 s init down down 0 . 01 0 . 01 0 . 5 0 . 99 0 . 99 c ACTION = down t goal 1 Y N controller σ P σ [ � goal ] max
Example 1: Computing controllers faster 6/12 1: repeat a for all transitions s −→ do 3: a U pdate ( s −→ ) 4: 5: until UpBound ( s init ) − LoBound ( s init ) < ǫ
Example 1: Computing controllers faster 6/12 More frequently update what is visited more frequently 1: repeat a for all transitions s −→ do 3: a U pdate ( s −→ ) 4: 5: until UpBound ( s init ) − LoBound ( s init ) < ǫ
Example 1: Computing controllers faster 6/12 More frequently update what is visited more frequently 1: repeat sample a path from s init 2: a for all visited transitions s −→ do 3: a U pdate ( s −→ ) 4: 5: until UpBound ( s init ) − LoBound ( s init ) < ǫ
Example 1: Computing controllers faster 6/12 More frequently update what is visited more frequently by reasonably good controllers 1: repeat sample a path from s init 2: a for all visited transitions s −→ do 3: a U pdate ( s −→ ) 4: 5: until UpBound ( s init ) − LoBound ( s init ) < ǫ
Example 1: Computing controllers faster 6/12 More frequently update what is visited more frequently by reasonably good controllers 1: repeat sample a path from s init 2: a for all visited transitions s −→ do 3: a U pdate ( s −→ ) 4: 5: until UpBound ( s init ) − LoBound ( s init ) < ǫ
Example 1: Computing controllers faster 6/12 More frequently update what is visited more frequently by reasonably good controllers 1: repeat a UpBound ( s −→ ) sample a path from s init ⊲ pick action arg max 2: a a for all visited transitions s −→ do 3: a U pdate ( s −→ ) 4: 5: until UpBound ( s init ) − LoBound ( s init ) < ǫ
Example 1: Computing controllers faster 6/12 More frequently update what is visited more frequently by reasonably good controllers 1: repeat a UpBound ( s −→ ) sample a path from s init ⊲ pick action arg max 2: a a for all visited transitions s −→ do 3: a U pdate ( s −→ ) 4: 5: until UpBound ( s init ) − LoBound ( s init ) < ǫ faster & sure updates important parts of the system
Example 1: Experimental results 7/12 Visited states Example PRISM with RL zeroconf 4,427,159 977 wlan 5,007,548 1,995 firewire 19,213,802 32,214 mer 26,583,064 1,950
Example 2: Computing small controllers 8/12 ◮ explicit map σ : S → A ◮ BDD (binary decision diagrams) encoding its bit representation ◮ DT (decision tree)
Example 2: Computing small controllers 8/12 ◮ explicit map σ : S → A ◮ BDD (binary decision diagrams) encoding its bit representation ◮ DT (decision tree)
Example 2: Computing small controllers 8/12 ◮ explicit map σ : S → A ◮ BDD (binary decision diagrams) encoding its bit representation ◮ DT (decision tree) Example #states Value Explicit BDD DT Rel.err(DT) % firewire 481,136 1.0 479,834 4233 1 0.0 investor 35,893 0.958 28,151 783 27 0.886 mer 1,773,664 0.200016 ——— MEM-OUT ——— * zeroconf 89,586 0.00863 60,463 409 7 0.106
Example 2: Computing small controllers 8/12 ◮ explicit map σ : S → A ◮ BDD (binary decision diagrams) encoding its bit representation ◮ DT (decision tree) Example #states Value Explicit BDD DT Rel.err(DT) % firewire 481,136 1.0 479,834 4233 1 0.0 investor 35,893 0.958 28,151 783 27 0.886 mer 1,773,664 0.200016 ——— MEM-OUT ——— * zeroconf 89,586 0.00863 60,463 409 7 0.106 * MEM-OUT in PRISM, whereas RL yields: 1887 619 13 0.00014
Example 2: Computing small controllers 9/12 precise decisions DT, importance of decisions Importance of a decision in s with respect to � goal and controller σ :
Example 2: Computing small controllers 9/12 precise decisions DT, importance of decisions Importance of a decision in s with respect to � goal and controller σ : P σ [ � s | � goal ]
Some related work 10/12 Further examples on decision trees ◮ Garg, Neider, Madhusudan, Roth: Learning Invariants using Decision Trees and Implication Counterexamples . POPL 2016 ◮ Krishna, Puhrsch, Wies: Learning Invariants Using Decision Trees. Further examples on reinforcement learning ◮ Junges, Jansen, Dehnert, Topcu, Katoen: Safety-Constrained Reinforcement Learning for MDPs. TACAS 2016 ◮ David, Jensen, Larsen, Legay, Lime, Sorensen, Taankvist: On Time with Minimal Expected Cost! ATVA 2014
Summary 11/12 Machine learning in verification ◮ Scalable heuristics ◮ Example 1: Speeding up value iteration ◮ technique : reinforcement learning, BRTDP ◮ idea : focus on updating “most important parts” = most often visited by good strategies ◮ Example 2: Small and readable strategies ◮ technique : decision tree learning ◮ idea : based on the importance of states, feed the decisions to the learning algorithm ◮ Learning in Verification (LiVe) at ETAPS ◮ Explainable Verification (FEVer) at CAV
Discussion 12/12 Verification using machine learning ◮ How far do we want to compromise? ◮ Do we have to compromise? ◮ BRTDP , invariant generation, strategy representation don’t ◮ Don’t we want more than ML? ◮ ( ε -)optimal controllers? ◮ arbitrary controllers – is it still verification? ◮ What do we actually want? ◮ scalability shouldnt overrule guarantees? ◮ when is PAC enough? ◮ Oracle usage seems fine ◮ How much of it can work for examples from robotics?
Recommend
More recommend