Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation Alane Suhr and Yoav Artzi
Executing Context- Dependent Instructions Task: map a sequence of instructions to actions Existing Work Today Symbolic System Actions Representations Learning from Modeling Context Exploration
Executing a Sequence of Instructions 2 6 7 1 3 4 5 Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Executing a Sequence of Instructions 2 6 7 1 3 4 5 Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Executing a Sequence of Instructions 2 6 7 1 3 4 5 Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Executing a Sequence of Instructions 2 6 7 1 3 4 5 Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Executing a Sequence of Instructions 2 6 7 1 3 4 5 Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Executing a Sequence of Instructions 2 6 7 1 3 4 5 Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Executing a Sequence of Instructions 2 6 7 1 3 4 5 Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Executing a Sequence of Instructions 2 6 7 1 3 4 5 Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Problem Setup • Task: follow sequence of instructions • Learning from instructions and corresponding world states Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Problem Setup • Task: follow sequence of instructions • Learning from instructions and corresponding world states Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Problem Setup • Task: follow sequence of instructions • Learning from instructions and corresponding world states Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Problem Setup • Task: follow sequence of instructions • Learning from instructions and corresponding world states Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Problem Setup • Task: follow sequence of instructions • Learning from instructions and corresponding world states Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Problem Setup • Task: follow sequence of instructions • Learning from instructions and corresponding world states Empty out the leftmost beaker of purple chemical Then, add the contents of the first beaker to the second Mix it Then, drain 1 unit from it Same for 1 more unit
Related Work • Context-dependent language understanding • Static environments Miller et al. 1996, Zettlemoyer and Collins 2009, Suhr et al. 2018 (e.g., large database) Long et al. 2016, • Environments that Guu et al. 2017, Fried change over time while et al. 2018 instructions are given Chen and Mooney 2011, Chen 2012, Artzi and • Following instructions in isolation; Zettlemoyer 2013, Artzi et al. 2014, Andreas and varying levels of supervision Klein 2015, Bisk et al. 2016, Misra et al. 2017
Today 1. Attention-based model for generating sequences of system actions that modify the environment 2. Exploration-based learning procedure that avoids biases learned early in training
System Actions 1 2 3 4 5 6 7 Mix it pop 2; • Each beaker is a pop 2; stack pop 2; • Actions are pop push 2 brown; push 2 brown; and push push 2 brown;
Meaning Representation 1 2 3 4 5 6 7 Mix it High-level Representation mix(prevArg2(2)) Engineering Program vs. pop 2; pop 2; pop 2; push 2 brown; System Learning push 2 brown; Abstractions Actions push 2 brown;
Meaning Representation 1 2 3 4 5 6 7 Mix it High-level Representation mix(prevArg2(2)) Engineering Program vs. pop 2; pop 2; pop 2; push 2 brown; System Learning push 2 brown; Abstractions Actions push 2 brown;
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state • Four inputs • Output: a sequence of actions • Attend over each input Current state when generating actions
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state Current state Encode instructions
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state Current state Encode states
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state Decoder state Current state Initialize decoder
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Attention Initial state Decoder state Current instruction Current state Attend over current instruction
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Attention Attention Initial state Decoder state Current instruction Previous instructions Current state Attend over previous instructions
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Attention Attention Initial state Decoder state Attention Current instruction Previous instructions Current state Initial state Attend over initial state
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Attention Attention Initial state Decoder state Attention Current instruction Previous instructions Current state Initial state Attention Current state Attend over current state
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state pop 7 Decoder state Current instruction MLP Previous instructions Current state Initial state Current state Predict action
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state pop 7 Current state Execute action, update state
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state pop 7 Current state Attention Attend over new state
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state pop 7 Action decoder pop 7 Current state
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state pop 7 Action decoder pop 7 pop 7 Current state
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state pop 7 Action decoder pop 7 pop 7 push 7 brown Current state
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state pop 7 Action decoder pop 7 pop 7 push 7 brown Current state push 7 brown
Model Current instruction Previous instructions Throw out first beaker Pour sixth beaker into last one It turns brown Initial state pop 7 Action decoder pop 7 pop 7 push 7 brown Current state push 7 brown push 7 brown
Learning from World State Annotation • Goal: learn a policy that maps from instructions and environment states to actions • Approach Empty out the leftmost beaker of purple chemical • Learn through exploring the environment and observing Then, add the contents of the first beaker to the second rewards • Policy gradient with contextual bandit Mix it Then, drain 1 unit from it • Challenge: overcome biases acquired early during learning Same for 1 more unit
Recommend
More recommend