game theoretic learning for verification and control
play

Game Theoretic Learning for Verification and Control Sanjit A. - PowerPoint PPT Presentation

Game Theoretic Learning for Verification and Control Sanjit A. Seshia Professor EECS, UC Berkeley Joint work with Dorsa Sadigh, Jon Kotker, Daniel Bundala, Anca Dragan, Alexander Rakhlin, S. Shankar Sastry Dagstuhl Seminar March 16, 2017 Two


  1. Game ‐ Theoretic Learning for Verification and Control Sanjit A. Seshia Professor EECS, UC Berkeley Joint work with Dorsa Sadigh, Jon Kotker, Daniel Bundala, Anca Dragan, Alexander Rakhlin, S. Shankar Sastry Dagstuhl Seminar March 16, 2017

  2. Two Stories: 1 Control, 1 Verification Control: Human Cyber-Physical Systems (e.g. autonomous/semi-autonomous driving) Learning (Synthesizing) Models of Human Behavior Verification: Timing Analysis of Embedded Software Learning (Synthesizing) Model of Platform (how platform impacts a program’s timing behavior) S. A. Seshia 2

  3. Challenge: Interactions with Humans and Human ‐ Controlled Systems outside the Vehicle “One of the biggest challenges facing automated cars is blending them into a world in which humans don’t behave by the book .” S. A. Seshia 3

  4. How can we make an autonomous vehicle behave/ communicate “naturally” with (possibly adversarial) humans in its environment?

  5. Interaction ‐ Aware Control • D. Sadigh, S. Sastry, S. Seshia, A. Dragan. Information Gathering Actions over Internal Human State. In IROS, 2016. • D. Sadigh, S. Sastry, S. Seshia, A. Dragan. Planning for Autonomous Cars that Leverages Effects on Human Actions. In RSS, 2016. S. A. Seshia 5

  6. Interaction as a Dynamical System direct control over � � indirect control over � � Model the problem as a Stackelberg Game . Robot moves first. 8

  7. Assumptions/Simplifications Model Predictive (Receding Horizon) Control: Plan for short time horizon N, replan at every step t. Assume deterministic “rational” human model, human optimizes reward function which is a linear combination of “features”. Human has full access to � � for the short time horizon. ∗ � � � � � � � � � 9

  8. Interaction as a Dynamical System ∗ � argmax ∗ �� � , � � �� � � � � � � �� � , � � , � � Find optimal actions for the autonomous vehicle while accounting for ∗ as optimizing ∗ . Model � � the human response � � the human reward function � � . ∗ � � , � � � argmax � � � � � � �� � , � � , � � � 10

  9. Learning (Human) Driver Models Learn Human’s reward function based on Inverse Reinforcement Learning [Ziebart et al, AAAI’08; Levine & Koltun, 2012] . Assume structure of human reward function: � , � � � , � � � � � � � , � � � � � ⏉ ��� � , � � � (a) Features for the (b) Feature for staying (c) Features for avoiding boundaries of the road inside the lanes. other vehicles. B. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008. S. Levine, V. Koltun. Continuous inverse optimal control with locally optimal examples. arXiv , 2012. 11

  10. Solution of Nested Optimization ∗ � argmax ∗ ��, � � �� � � � � � � ��, � � , � � � � , � � � � � � , � � � � �, � � , � � � � � ��� Gradient-Based Method (Quasi- Newton) : (solve using L-BFGS technique) ∗ � � � ∗ � � � � � � � � ∗ �, � � � argmax � � � � � � ��, � � , � � � � � , � � � � � � , � � � � �, � � , � � � � � ���

  11. Implication: Efficiency obot uman

  12. Implication: Efficiency

  13. Implication: Efficiency

  14. Implication: Coordination

  15. Implication: Coordination

  16. Human crossing First y of Human Vehicle Human crossing Second x of Autonomous Vehicle

  17. Summary • Model control problem as Stackelberg Game • Data ‐ driven approach to learning model of human as rational agent maximizing their reward function – Next Steps: more realistic human model (“bounded rational” model) • Combine with receding horizon control approach to obtain interaction ‐ aware controller – Next Steps: Combine with previous work on correct ‐ by ‐ construction control with temporal logic specifications • Temporal logic compiled into constraints – Need to improve constrained optimization methods! S. A. Seshia 20

  18. Two Stories: 1 Verification, 1 Control Control: Human Cyber-Physical Systems (e.g. autonomous/semi-autonomous driving) Learning (Synthesizing) Models of Human Behavior Verification: Timing Analysis of Embedded Software Learning (Synthesizing) Model of Platform (how platform impacts a program’s timing behavior) S. A. Seshia 21

  19. Game ‐ Theoretic Timing Analysis • S. A. Seshia and A. Rakhlin. Game-Theoretic Timing Analysis . In ICCAD 2008. • S. A. Seshia and A. Rakhlin. Quantitative Analysis of Systems Using Game-Theoretic Learning . In ACM Trans. Embed. Sys., 2012. S. A. Seshia 22

  20. Challenge in Timing Analysis Challenge in Timing Analysis Does the brake-by-wire software always actuate the brakes within 1 ms? NASA’s Toyota UA report (2011) mentions: “ In practice…there are significant limitations ” (in the state of the art in timing analysis). CHALLENGE: ENVIRONMENT MODELING Need a good model of the platform (processor, memory hierarchy, network, I/O devices, etc.) – 23 –

  21. Complexity of a Timing Model: Complexity of a Timing Model: Path Space x Platform State Space Path Space x Platform State Space On a processor with a data cache flag!=0 Timing of an edge (basic block) depends on: flag=1; x • Path it lies on (*x)++; • Initial platform state flag!=0 Challenges: • Exponential number of paths and platform states! *x += 2; • Lack of visibility into platform state Program CFG unrolled to a DAG – 24 –

  22. Example: Automotive Window Controller Example: Automotive Window Controller ~ 1000 lines of C code ~ 10 16 paths – 25 –

  23. Our Approach and Contributions Our Approach and Contributions [S. A. Seshia & A. Rakhlin, ICCAD ’08, ACM TECS] Model the estimation problem as a Game – Tool vs. Platform  Measurement-based, but minimal instrumentation – Perform end-to-end measurements of selected (linearly many) paths on platform  Learn Environment Model – Similar to online shortest path in the ‘bandit’ setting  Online, randomized algorithm: GameTime – Theoretical guarantee: can predict worst-case timing with arbitrarily high probability under model assumptions  Uses satisfiability modulo theories (SMT) solvers for test generation – 26 –

  24. The Game Formulation The Game Formulation  Complexity  Path Space x Platform State Space (controllable) (uncontrollable)  Model as a 2-player Game: Tool vs. Platform – Tool selects program paths – Platform ‘selects’ its state (possibly adversarially)  Questions: – What is a good platform model? – How to select paths so that we can learn an accurate platform model from executing those? – 27 –

  25. Platform Model Platform Model Platform selects weights for edges of the CFG Models path-independent timing w Nominal weight on edge of unrolled CFG + + Path-specific perturbation  Models path-dependent timing – 28 –

  26. A Path is a Vector x  {0,1} m A Path is a Vector x  {0,1} m ( m = #edges) 1 1 1 1 Insight: Only need to sample 1 a Basis of the space of paths 1 – 29 –

  27. Basis Paths Basis Paths #(basis paths 1 ≤ m 1 < 200 basis paths for automotive 1 controller 1 Useful to compute certain special 1 bases called “barycentric 1 spanners” – 30 –

  28. Timing Analysis Game (Our Model) Timing Analysis Game (Our Model) Played over several rounds t = 1, 2, 3, …,  At each round t: 5 Tool Platform picks x t 7 picks w t CFG 1 Platform picks  t ( x t ) 11 (-1, -1, -1, -1) Tool observes l t = x t · (w t +  t ) (5+7+1+11) - 4 = 20 At round  : Tool makes prediction (longest path x*  )  Tool wins iff its prediction is correct – 31 –

  29. Theorem about Estimating Distribution Theorem about Estimating Distribution (pictorial view) (pictorial view) Mean Perturbation Assumption:  x  Paths | E [ x .  t ] | ≤  max  is O ( b  max ) (exec. time) – 32 –

  30. Some Experimental Results Some Experimental Results (details in ICCAD’08, ACM TECS, FMCAD’11 papers)  GameTime is Efficient – E.g.: 7 x 10 16 total paths vs. < 200 basis paths  Accurately predicts WCET for complex platforms – I & D caches, pipeline, branch prediction, …  Basis paths effectively encode information about timing of other paths – Found paths 25% longer than sampled basis  GameTime can accurately estimate the distribution of execution times with few measurements – Measure basis paths, predict other paths – 33 –

  31. Discussion: Qualitative Characterization of the Problems Described Control/Synthesis Verification/Analysis Adversarial Almost black-box (w+  ) platform model • Platform only constrained by assumptions on w,  • • Know only structure of human reward function beforehand, observe entire system state • Human can behave arbitrarily, albeit only as a rational agent, not actively violating robot’s obj. Cooperative No Full Information Information S. A. Seshia 34

Recommend


More recommend