Slide 1 Learning to Plan with Logical Automata Brandon Araki 1 *, Kiran Vodrahalli 2 *, Thomas Leech 1,3 , Mark Donahue 3 , Cristian-Ioan Vasile 1 , Daniela Rus 1 1 Massachusetts Institute of Technology 2 Columbia University 3 MIT Lincoln Laboratory *Equal contributors 1 Slide 2 Many environments have simple rules – for example cooking from a recipe, playing games, driving, and assembly. People are able to learn how to perform tasks like these by observing an expert. When observing an expert, people don’t learn to just mimic the 2 expert. They learn the rules that the expert is following. This allows a person who has, for example, learned to cook a dish to modify the ingredients they put in the dish or the order in which they add ingredients. Slide 3 Our goal is to replicate this ability Goals algorithmically using model-based Learn to plan in an environment with rules imitation learning. 1. Learn the rules in a way that they can be easily interpreted by humans 2. Incorporate the rules into planning so that modifying the rules results in predictable changes in behavior 3
Slide 4 Let’s say you have a robot that has to Packing a Lunchbox pack a lunchbox. Pack a burger or a sandwich; then pack a banana The rules are that it has to first pack a burger or a sandwich, and then pack the banana. 4 Slide 5 We can make these rules both useful Goal 1 – Interpretability and interpretable by representing Rules them as a finite state automaton. Pack a burger or a sandwich; then pack a banana … Finite State Automaton GOAL! Initial Picked up Packed Picked up Packed We assume that the environment can State or or 5 be factored into a high-level Markov Decision Process which is equivalent to the FSA, and a low-level MDP of the sort usually used in reinforcement learning. Slide 6 So you can imagine that we have a Factoring the Environment reinforcement learning robot arm Pack sandwich or burger; simulator with state x, y, theta, etc, High-level MDP Then pack banana Avoid obstacles and actions such as torques or commanded positions. Low-level MDP We assume that there is also a high- 6 level MDP, which embodies the rules that the robot must follow.
Slide 7 Representing the Environment Pack sandwich or burger; Initial Picked up Packed Finite state automaton State Then pack banana or or Avoid obstacles Picked up Packed Discrete 2D gridworld 7 Slide 8 We want to be able to modify the Goal 2 – Manipulability behavior of the agent in order to make Incorporate FSA into planning it perform similar but new tasks to the S0 S1 S2 S3 G one it has learned. Initial Picked up Packed Picked up Packed State or or S0 o Ø S0 We achieve this by incorporating the S1 S2 S3 FSA into a recursive planning step. G T 8 Since FSAs are graphs, they can be converted into a transition matrix. First it is useful to label each FSA state with a name. And here is the transition matrix of the first FSA state. The columns are associated with features of the environment, and the rows correspond to FSA states. You can see by looking at the graph that the sandwich and the burger cause a transition to state S1, whereas the other items do not cause a transition to a new state.
Slide 9 We use differentiable recursive Differentiable Recursive Planning planning to approximate value Learn transitions of FSA Learn reward iteration and calculate a policy for the Learn agent. The matrix form of the FSA transitions One VIN allows us to embed the FSA as a for each convolution in the planning step – for FSA state more details, come to our poster Based on Tamar, Aviv, et al. "Value iteration networks." Advances in Neural Information Processing Systems . 2016. 9 session. Slide 10 This is the learned transition matrix of Experiments - Interpretability the first state of the FSA. Columns correspond to propositions, Propositions S0 o Ø S0 or important features of the S1 FSA States S2 environment. S3 G Rows correspond to the other FSA T states. 10 Slide 11 Experiments - Interpretability Picking up the sandwich or S0 the hamburger causes a o Ø transition to the next state S0 S1 S2 S3 G T 11
Slide 12 Since we have learned an Experiments – Manipulability interpretable model of the rules, we We can modify the FSA so that it can easily modify the rules to change will only pick up the burger and not the sandwich. the behavior of the agent. In terms of the FSA, this means just deleting this Initial Picked up Packed Picked up Packed State or or edge between the initial state and the next state. 12 Slide 13 Experiments – Manipulability We can modify the FSA so that it will only pick up the burger and not the sandwich. Initial Picked up Packed Picked up Packed State or or 13 Slide 14 This is also easy to express using the Experiments – Manipulability transition matrix of the FSA; we can S0 We can modify the FSA so that it change the values in the matrix to o Ø will only pick up the burger and S0 not the sandwich. S1 change the form of the FSA. S2 S3 G T 14
Slide 15 Experiments – Manipulability S0 We can modify the FSA so that it o Ø will only pick up the burger and S0 not the sandwich. S1 S2 S3 G T 15 Slide 16 Experiments – Manipulability S0 We can modify the FSA so that it o Ø will only pick up the burger and S0 not the sandwich. S1 S2 S3 G T 16 Slide 17 Learning to Plan with Logical Automata Brandon Araki 1 *, Kiran Vodrahalli 2 *, Thomas Leech 1,3 , Mark Donahue 3 , Cristian-Ioan Vasile 1 , Daniela Rus 1 1 Massachusetts Institute of Technology 2 Columbia University 3 MIT Lincoln Laboratory *Equal contributors 17
Recommend
More recommend