Embodied Machines • The Grounding (binding) Problem – Real cognizers form multiple associations between concepts • Affordances - how is an object interacted with • Frames - Background structure against which concept is understood -- sometimes highly complex (Educational system, family relationships) • Emotions - witnessing event/seeing object conjures up emotional states • Mental simulation - comprehending language may trigger imagistic modeling of event based on experience
Embodied Machines – Mouse • Mammal, Small, furry, grey to brown, long whiskers, cats like to play with them and then eat them, they’re used in experiments, ladies stand on chairs when they’re around, they squeak, they’re prolific breeders, they’re sold live as snake food, they’re one kind of rodent, they look a lot like rats, they are sometimes pets, they like to run on a wheel… – Play • The opposite of work, it’s fun, kids do it, scheduled in during grade school, you play games, you play with words, …
Embodied Machines – Approaches to meaning construction • NLP – Text/speech is considered comprehended when parsed syntactically, and when word meanings have been assigned – Meaning is pre-determined by humans in some way • Embodied approach – World has no structure until body begins to interact in it » Need goals & sensorimotor system – Experience --> meaning – Words map onto meaning
Embodied Machines – Steel’s talking heads • Simple robots – Auditory & visual systems – Motivating goal = language game • Simple environment – 2 dimensional world containing objects • Robots determine their own categories for objects • Robots determine their own labels for categories • Robots and environment are real physical entities
Embodied Machines – Cangelosi & Parisi • Virtual agents, virtual world • A kind of embodied learning – Agents have physical location, orientation, movement capabilities within their environment – Agents consume mushrooms which affects their energy status – Agents (collectively) have a motivating task --> increase fitness of species – They sense perceptual characteristics, not mushrooms --> they learn which characteristics describe real vs. poisonous mushrooms – Agents (collectively) learn to categorize and label mushrooms
Embodied Machines – CELL (Deb Roy) • Cross channel Early Lexical Learning • Models embodied language learning using input that approximates input to human infants Instantiated in robot body with microphone/camera • CELL learns to form word meaning correspondences from raw (unsegmented) audio and visual input
Embodied Machines – First Task • Segmentation – Audio stream parsing into segments – Video stream parsing into objects – Segmentation process produces channel of ‘words’ and channel of shapes – Second Task • Build a lexicon by identifying frequently co-occurring pairs of audio & visual segments
Embodied Machines • Illustrative example (not from actual data) • Imagine an utterance: “…don’t throw the ball at the cat…” Uttered in a scene containing these identified objects (Noise present)
Embodied Machines • Objects not necessarily identified in same order as named in utterance • Time delays between utterance and object recognition highly likely …throw the ball at the cat
Embodied Machines – Short term memory (STM) – look at a temporal window surrounding each word – Aim is to go back or forward far enough in time to have the word and referent in same window … throw the ball at the cat Short term memory
Embodied Machines – Window marches through data stream collecting segmented objects and words for possible mapping …throw the ball at the cat Short term memory
Embodied Machines …throw the ball at the cat Short term memory
Embodied Machines …throw the ball at the cat Short term memory
Embodied Machines • Audio and visual segments that have a high degree of mutual information—are likely semantically linked and should be saved in long term memory (LTM) ∑ Unique occurrences Objects … … Words Ball 5 57 Cat 6 100 90,000 The 40 50 ∑ unique 59 116
Embodied Machines • Mutual information MI = P(a&b) ≅ co-occurrence (a&b) ------------- ----------------------------------- P(a) P(b) occurrence (a) * occurrence (b) P (‘cat’ & ) P (‘the’ & ) = 40/(100 * 59) = 40/(90,000 * 59) = 0.0067 = 0.0000075 Words like ‘the’ are promiscuous. They co-occur with so many categories, they lack predictive power.
Embodied Machines • Two implementations of CELL – Robot – Learning from observing Infant/Caregiver interaction
Embodied Machines • Robot – Input: spoken utterances and images of objects acquired from video camera mounted on robot – Experimenter places objects in front of the robot and describes them – Acquisition of lexicon • Robot gathers visual information about environment while listening to speech (discovers high MI pairs) – Speech generation • Search for objects in environment then describe – Speech understanding (maps word to object)
Embodied Machines • Learning from infant-caregiver interaction – Infants played with 7 classes of objects • Balls, shoes, keys, toy cars, trucks, dogs, horses • Care-giver/infant interaction was natural – CELL attempted to build up lexicon from observing these interactions • Segmentation accuracy (segment boundaries correspond to word boundaries?) • Word discovery (segments correspond to single word?) • Semantic accuracy (if word segmented properly, is it properly mapped to an object?)
Embodied Machines • Segmentation accuracy – 28% (compared to 7% for acoustic only model) • Word discovery – 72% of segmented items were single words (compared to 31% for acoustic only model) • Semantic accuracy – 57% of hypothesized lexical candidates are both valid words and were linked to semantically relevant visual categories
Recommend
More recommend