Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions Clara Cannon ccannon@cs.utexas.edu
AGENDA INTRODUCTION ● MACRO ARCHITECTURE ● Modeling Route Instructions ○ Representing Expected Views and Actions ○ Interleaving Action, Perception, and Modeling ○ Robustness to Errors and Ambiguities ○ Inferring Actions Implicit in Instructions ○ EVALUATION ● Implicit Action Inference Experiment ○ CONCLUSION ● DISCUSSION ●
Introduction Senario: A host provided instructions for you to find an office in a building you have never visited. Most likely, the instruction set will be incomplete and you will have to infer certain steps in order to arrive at the correct destination. You are using your knowledge of language, understanding of spatial actions (i.e. “turn so that your back is facing the pink wall”), and a model of the environment to resolve ambiguities.
What is Marco? An agent that follows free-form natural language route ● instructions Represents and executes a sequence of compound action ● specifications Infers implicit actions ● Can perform, implicit, explicit, and exploratory actions in ● its environment Manually built and hand tuned ●
MARCO Architecture 6 Modules: 1. Syntax Parser 2. Content Framer Linguistic Grounding 3. Instruction Modeler 4. Executor 5. Robot Controller Spatial Grounding 6. View Description Matcher
MACRO Architecture An example of a parse tree, shows the transformation of text into an ● imperative model Syntax Parser : models surface structure of a word/statement ● Content Framer : interprets surface meaning of a word/statement ● Instruction Modeler: applies spatial and linguistic knowledge to ● combine information across phrases and sentences Executor : interleaves action and perceptions; acts to gain ● knowledge of environment and execute instructions in the context of the spatial model Robot Controller : interface for the particular follower's motor and ● sensory functions View Description Matcher : checks symbolic view descriptions ● against sensory observations and world models (expected model against world model)
Modeling Route Instructions Syntax parser parses raw route instruction text ● The content framer translates the surface structure of a ● word/statement to a model of the surface meaning as a nested attribute value matrix The content frame models the nested structure and sense of a ● word/statement by deleting punctuation, arbitrary text ordering, inflectional suffixes, and spelling variations Syntax parser uses probabilistic context-free grammar which ● directly models verb-argument structure Content framer gets word sense from WordNet ●
Modeling Route Instructions Instruction modeler translates the content frame’s ● representation of the surface meaning of an instruction element to an imperative model containing compound action specifications Infers the model by applying knowledge of verbs and ● prepositions in route instructions and knowledge of how perception and action depend on local spatial configuration in similar environments
Representing Expected Views and Actions View description represents what the follower expects at ● a pose/orientation in the environment, given descriptions in the instructions Instructions tend to describe some distinctive attribute of some ● scenes along the route For each expected object, it models the object’s type, ● location within the relative view of observer, and description of appearance and attributes
Representing Expected Views and Actions Route instructions require at least four low level simple ● actions TURN: changes an agent’s pose/orientation but preserves its ○ location TRAVEL: changes an agent’s location but preserves its pose ○ VERIFY: checks an observation against a description of an expected ○ view DECLARE-GOAL: terminated instruction following by assertion ○ that the agent is at the desired destination
Representing Expected Views and Actions Compound action specifications capture the commands ● in route instructions by modeling which simple actions to take under which perceptual or cognitive conditions Each clause is interpreted as a compound action ● specification Adverbs, verb objects, and prepositional phrases ● translate to pre-conditions, while-conditions, and post conditions
Interleaving Action, Perception, and Modeling The executor sequences simple actions given the environment ● context and state of the following route instruction Executes each compound action specification before moving to ● the next The robot controller executes simple actions ● The view description matcher checks symbolic view descriptors ● against sensory observations. It treats view description as constraints the observation stream must meet. Defers handling ambiguity until the environment can provide ● enough context to disambiguate
Robustness to Errors and Ambiguities When MARCO does not know a word, it searches for its ● nearsted know synonym or abstract hypernym using WordNet If the content framer encounters a constituent it cannot model, ● the constituent is ignored and the remaining clause is modeled Similar strategy applies to the parser. If the parser cannot parse ● a sentence from a set of instructions, it will parse the others. Argument for techniques working: ● 1. Route instructions contain redundant information 2. Essential information in route instructions is stated using a small variety of content frames for direction movements
Inferring Actions Implicit in Instructions Implicit actions are inferred using linguistic and spatial ● knowledge Ex: “Go down the hall to the chair” ● The language model interprets the phrase structure as ● along and until parameters of a TRAVEL action MACRO infers conditions of the TRAVEL action as ● Pre: the path should be immediately in front and the chair should be ○ in the front in the distance Post: The chair will be local to the agent ○
Inferring Actions Implicit in Instructions If the pre and post condition are not met, the executor ● may take exploratory actions to gain information or determine the location of a reference object The figure shows how an instruction is applied to ● navigate with different maps and starting poses In some scenarios, the agent must perform an ● exploratory action in order to orient itself with the environment according to the instruction
Evaluation Evaluated MACRO in 3 environments with corpus of ● route instructions written by 6 human directors Corpus consists of 786 route instruction texts (682 ● with omissions) 36 human subjects followed route instructors to ● establish baseline Used desktop virtual reality environment ● Text route instructions were ranked 1-6 (1: vague, ● hard to follow, 6: detailed, easy to follow)
Evaluation Benefits of using VR environment: 1. All route directors had similar exposure to environment 2. Pertinent aspects of environment were known and repeatable 3. Directors learn environment through first person perspective as followers 4. MARCO can navigate same environment as people
Evaluation For testing, gave the agent route instructions via text ● from a starting location Success was determined by whether or not the agent ● reached the desired destination Did not account for speed of arrival ● No explanation for what happens when an agent gets ● lost No account of what happens when an agent makes an ● incorrect inference (i.e. incorrectly identifying the color blue)
Implicit Action Inference Experiment Results for 5 types of follower: (1) human ● participants, (2) full MARCO model, (3) MARCO w/o TURN inference, (4), MARCO w/o TRAVEL inference, (5) MARCO w/o TURN or TRAVEL inference Humans were able to find the destination with an ● overall mean success rate of 69% MARCO successfully followed 61% of route ● instructions TURN inference is more valuable than TRAVEL ● inference
Conclusion Authors claim this experiment is more easily and less ● expensively replicated than similar works This paper uses knowledge of language and space to infer ● implicit actions when following natural language route instructions Contributes an assessment of human performance for ● communicating route information through unfamiliar large scale spaces Future work includes: replacing executor algorithm with a full ● action sequencer or an algorithm reasoning on inferred route topology, generalizing methods to larger domains (i.e. cooking, first aid, etc), using similar evaluation techniques for other large scale instructional tasks
Discussion!
Recommend
More recommend