Visually Grounded Meaning Representation Qi Huang Ryan Rock

Outline 1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment: Categorization 7. Takeaway & Critiques

Motivation ● Word embedding is “disembodied”: represent word meaning only as the statistical pattern of its surrounding context, without grounding to their real-world referents, which is usually accessible in modality other than text Problem: the distribution of word representation overfits (and ● fundamentally limited by) the training corpus’ statistical patterns; the learned representation cannot generalize well

Motivation Question: Can we take other modality information as input when building word representation?

Motivation Solution: Cognitive Science study shows that semantic attributes can represents ● multisensory information of a word not present in the surrounding text. ○ Example: apples are “green”, “red”, “round”, “shiny”. Model: a (stacked)autoencoder based model that learn high-level meaning ● representations by mapping words and images (represented as attributes ) into a common hidden space ● Verification: experiment on world similarity and concept categorization

Autoencoders An autoencoder is an unsupervised feed-forward neural network which is trained to reconstruct a given input from its hidden distribution. Encoder : map input vector x to a hidden representation h Decoder : reconstruct x (y) from hidden representation x Minimize the reconstruction loss:

Denoising in autoencoder Denoising: reconstruct clean input given a corrupted input. For example, randomly mask out some elements in the input. Effect: learn to activate knowledge about a concept when being exposed to partial information

Stacked autoencoders ● Stacked autoencoders is essentially a collection of autoencoder “stacked” on top of each other To initialize weight for each layer, train a collection of autoencoder, one at a ● time. Feed one autoencoder’s output as the next autoencoder’s input ● Fine tune the model end-to-end afterwards with unsupervised training criterion (global reconstruction) or supervised criterion

Visually Grounded Autoencoders Train text autoencoder and image autoencoder with two hidden layers separately

Visually Grounded Autoencoders Feed their respective encoding as input to obtain the bimodal encoding

Visually Grounded Autoencoders Finetune the whole model with global reconstruction loss, and label prediction as a supervised signal The bimodal encoding is used as the final word representation

Some training details: ● Weights of each AE are tied (encoder and decoder) ： ● Denoising for image modality: treat x itself as “corrupted” and the “clean” version is the centroid of multiple image embedding containing that object

Constructing visual & textual attribute representation In our case, the two modalities are unified in their representation by natural language attributes (in vector forms) Goal: allow to generalize to new instances when there are no training example available

Constructing visual attribute representation ● VISA dataset, built from McRae feature norms and images from ImageNet that represent McRae’s concepts. ● McRae feature norm: a collection of concepts with vectorized representation, each entry corresponds to a property of that concept ● 541 concepts represented by 700K images. Each concept has 5 (“prune”) - 2,149 (“closet”) images

Constructing visual attribute representation The concepts and attributes essentially form a bipartite graph

Constructing visual attribute representation Train a SVM classifier for each attributes, images with that attribute as positive examples; otherwise as negative examples on image processed by hand-crafted feature extractor. Concept’s visual attribute representation is the average of all consisting images’ visual vectors

Constructing textual attribute representation ● Text attributes obtained from Strudel, “a distributional model akin to other vector-based models except that collocates of a concept are established by relations to other concepts interpreted as properties” Vector representation: each element represent the concept-attribute pair’s ● strength of association ● Attributes are automatically discovered by the Strudel algorithm from training corpus Also take word embeddings from continuous skip-gram model for ● comparison

Visual Attributes & Textual Attributes Comparison

Experiments: Similarity and Categorization

Experiment 1: Similarity ● Create dataset by pairing concrete McRae nouns and assigning semantic and visual similarity scores from 1 to 5

Comparison Models ● Compare against ○ Kernelized canonical correlation analysis (kCCA) ○ Deep canonical correlation analysis (DCCA) ○ SVD projection ○ Bimodal SAE ○ Unimodal SAE

CCA Find projections that maximally correlate two random vectors ● ○ CCA: linear projections ○ kCCA: nonlinear projections ○ DCCA: nonlinear deep projections ● Vectors X1: textual attributes ○ ○ X2: visual attributes

SVD on tAttrib + vAttrib ● Create matrix of all objects’ textual + visual attributes by row ○ Compute the SVD ○ Use right eigenvectors to project attributes

Bruni’s SVD Collect textual co-occurrence matrix from ukWaC and WaCkypedia ● Collect visual information through SIFT bag-of-visual words on ESP ● ● Harvest text-based and image-based semantic vectors for target words ● Concatenate textual and visual information by row Take the SVD and project objects to lower-rank approximations ●

Similarity Metric ● Take cosine of angle between vectors ● Calculate correlation coefficient against humans

Similarity Metrics Bimodal SAE (skip-gram, vAttrib) gets best results ● ○ Semantic: 0.77 Visual: 0.66 ○

Experiment 2: Categorization ● Unimodal classifiers exist – ResNet ● Can bimodal classifiers perform better? Use same comparison models ●

Experiment 2: Categorization ● Create a graph ○ node: object ○ edge: semantic or visual similarity weight ● Use Chinese Whispers algorithm

Chinese Whispers ● Gets its name from ‘Chinese Whispers’ or ‘telephone’ game ○ Original message is iteratively distorted Nodes iteratively take on class with maximum weight in neighborhood ●

Chinese Whispers

Categorization Results ● Compare classifications against AMT categories ○ Use F-score – harmonic mean of precision and recall Variables ● ○ s: class n: size ○ ○ h: cluster

Categorization Results ● Bimodal SAE (skip-gram, vAttrib) performs best ○ F-score of 0.48

Conclusion ● Bimodal SAE performs ‘inductive inference’ ○ Induce representation for missing mode by learning statistical redundancy between modes ○ Predict reasonable, interpretable textual attributes given visual attributes ■ jellyfish: swim, fish, ocean ■ currant: fruit, ripe, cultivate

Critiques ● Evaluation mainly on unimodal or bimodal data ○ No evaluation on inductive inference Skip-gram + vAttrib outperforms tAttrib + vAttrib ● ● Bootstrap visual attributes with SVM ○ Bottleneck SAE performance on SVM performance

Critiques ● Textual “attributes” and virtual attributes do not share the same vocabulary -- intentional or a mistake? Is it textual attributes or just co-occurrence words? Extract image features using NNs instead of specialized feature extractor? ● ● Data quality & size -- two authors manually inspect concept-visual attribute relations, with only 500+ concepts If only proof of concept -- can it generalize well to abstract concept? ●

Visually Grounded Meaning Representation Qi Huang Ryan Rock - PowerPoint PPT Presentation

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment:

Response-based Learning for Grounded Grounded SMT Riezler, Machine Translation Simianer, Haas

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic

Outline Introduction Definition History Features When should Grounded Theory be used? Types

TAKE TAKE GROUNDED GROUNDED DECISIONS DECISIONS Farm Modelling Statistic based, gamification

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen

Representations of language in a model of visually grounded speech signal Grzegorz Chrupaa

Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August

Visually Grounded, Task-oriented Dialogue Elia Bruni Outline Language grounding Visual dialogue

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Blind/Visually Impaired Silvia Ludena Veronica Sarabia Katie Stoddard Huong Vo Blind/Visually

K K Knowledge Knowledge l d l d Representation Representation Representation

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz

Every Little Step Counts An Effective Model for Culturally Grounded Pediatric Diabetes Prevention

Dialog as a Vehicle for Lifelong Learning of Grounded Language Understanding Systems Aishwarya

What are the research questions for this course? How is the knowledge (in our minds) grounded

Business Process Management Journal Mobile customer relationship management: underlying issues and

Identifying Opportunities for R&D and Collaboration Roundtable Discussion ANL Marcel

Kernels for deterministic and stochastic approximations of (invariant) functions David

Sentence Analysis (with TIL) Knowledge of language is modular. COLING2000: Angela Friederici,

CS244 Advanced Topics in Networking Lecture 10: Buffer Sizing Nick McKeown Sizing Router

External buffer Raslan Darawsheh Mellanox External buffer First was introduced by Olivier

Design of Adaptive Communication Design of Adaptive Communication Channel Buffers for Low-

Fast Buffer Insertion Considering Process Variation Jinjun Xiong, Lei He EE Department EE

Visually Grounded Meaning Representation Qi Huang Ryan Rock - PowerPoint PPT Presentation

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment:

Response-based Learning for Grounded Grounded SMT Riezler, Machine Translation Simianer, Haas

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic

Outline Introduction Definition History Features When should Grounded Theory be used? Types

TAKE TAKE GROUNDED GROUNDED DECISIONS DECISIONS Farm Modelling Statistic based, gamification

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen

Representations of language in a model of visually grounded speech signal Grzegorz Chrupaa

Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August

Visually Grounded, Task-oriented Dialogue Elia Bruni Outline Language grounding Visual dialogue

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Blind/Visually Impaired Silvia Ludena Veronica Sarabia Katie Stoddard Huong Vo Blind/Visually

K K Knowledge Knowledge l d l d Representation Representation Representation

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz

Every Little Step Counts An Effective Model for Culturally Grounded Pediatric Diabetes Prevention

Dialog as a Vehicle for Lifelong Learning of Grounded Language Understanding Systems Aishwarya

What are the research questions for this course? How is the knowledge (in our minds) grounded

Business Process Management Journal Mobile customer relationship management: underlying issues and

Identifying Opportunities for R&amp;D and Collaboration Roundtable Discussion ANL Marcel

Kernels for deterministic and stochastic approximations of (invariant) functions David

Sentence Analysis (with TIL) Knowledge of language is modular. COLING2000: Angela Friederici,

CS244 Advanced Topics in Networking Lecture 10: Buffer Sizing Nick McKeown Sizing Router

External buffer Raslan Darawsheh Mellanox External buffer First was introduced by Olivier

Design of Adaptive Communication Design of Adaptive Communication Channel Buffers for Low-

Fast Buffer Insertion Considering Process Variation Jinjun Xiong, Lei He EE Department EE

Identifying Opportunities for R&D and Collaboration Roundtable Discussion ANL Marcel